2025-05-07T20:22:34.9521163Z Current runner version: '2.323.0'
2025-05-07T20:22:34.9528211Z Runner name: 'i-03e120d7c73b3b069'
2025-05-07T20:22:34.9529130Z Machine name: 'ip-10-0-57-2'
2025-05-07T20:22:34.9531951Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:34.9534314Z Contents: read
2025-05-07T20:22:34.9534820Z Metadata: read
2025-05-07T20:22:34.9535300Z Packages: read
2025-05-07T20:22:34.9535787Z ##[endgroup]
2025-05-07T20:22:34.9538071Z Secret source: None
2025-05-07T20:22:34.9538912Z Prepare workflow directory
2025-05-07T20:22:35.0047220Z Prepare all required actions
2025-05-07T20:22:35.0083108Z Getting action download info
2025-05-07T20:22:35.2490070Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:35.5300029Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:35.9025957Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:37.4602453Z Getting action download info
2025-05-07T20:22:37.5928112Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:37.9008663Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.11, 12.6.3, 12.6.3, gcc)
2025-05-07T20:22:37.9540190Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:37.9655441Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:37.9667684Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:37.9668354Z ##[endgroup]
2025-05-07T20:22:39.2045839Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:39.2046265Z Instance Type: g5.4xlarge
2025-05-07T20:22:39.2046501Z AMI Name: unknown
2025-05-07T20:22:39.2087177Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.6626432Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.6626737Z with:
2025-05-07T20:22:44.6626958Z   submodules: true
2025-05-07T20:22:44.6627202Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.6627575Z   token: ***
2025-05-07T20:22:44.6627775Z   ssh-strict: true
2025-05-07T20:22:44.6627972Z   ssh-user: git
2025-05-07T20:22:44.6628195Z   persist-credentials: true
2025-05-07T20:22:44.6628436Z   clean: true
2025-05-07T20:22:44.6628660Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.6628924Z   fetch-depth: 1
2025-05-07T20:22:44.6629137Z   fetch-tags: false
2025-05-07T20:22:44.6629351Z   show-progress: true
2025-05-07T20:22:44.6629563Z   lfs: false
2025-05-07T20:22:44.6629769Z   set-safe-directory: true
2025-05-07T20:22:44.6630012Z env:
2025-05-07T20:22:44.6630222Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.6630519Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.6630779Z   BUILD_TARGET: genai
2025-05-07T20:22:44.6630994Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.6631261Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:44.6631508Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.6631737Z ##[endgroup]
2025-05-07T20:22:44.7782279Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:44.7783431Z ##[group]Getting Git version info
2025-05-07T20:22:44.7783843Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.7784434Z [command]/usr/bin/git version
2025-05-07T20:22:44.7784691Z git version 2.47.1
2025-05-07T20:22:44.7798896Z ##[endgroup]
2025-05-07T20:22:44.7813420Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/a7fe0bb1-bbbf-46da-842d-f986fdd76615' before making global git config changes
2025-05-07T20:22:44.7814506Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:44.7829386Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.7868573Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.7872096Z ##[group]Initializing the repository
2025-05-07T20:22:44.7876278Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.7917423Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:44.7918370Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:44.7919163Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:44.7919781Z hint:
2025-05-07T20:22:44.7920193Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:44.7920738Z hint:
2025-05-07T20:22:44.7921228Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:44.7922161Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:44.7922836Z hint:
2025-05-07T20:22:44.7923209Z hint: 	git branch -m <name>
2025-05-07T20:22:44.7924085Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:44.7932651Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:44.7968530Z ##[endgroup]
2025-05-07T20:22:44.7969220Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:44.7973167Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:44.8004828Z ##[endgroup]
2025-05-07T20:22:44.8005451Z ##[group]Setting up auth
2025-05-07T20:22:44.8012274Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:44.8045924Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:44.8411236Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:44.8445699Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:44.8796306Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:44.8843910Z ##[endgroup]
2025-05-07T20:22:44.8844466Z ##[group]Fetching the repository
2025-05-07T20:22:44.8853443Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:45.3842339Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.3842873Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:45.3867122Z ##[endgroup]
2025-05-07T20:22:45.3867532Z ##[group]Determining the checkout info
2025-05-07T20:22:45.3870642Z ##[endgroup]
2025-05-07T20:22:45.3886919Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:45.3926604Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:45.3961358Z ##[group]Checking out the ref
2025-05-07T20:22:45.3966054Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:45.5072502Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:45.5072969Z 
2025-05-07T20:22:45.5073196Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:45.5073705Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:45.5074198Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:45.5074587Z 
2025-05-07T20:22:45.5074812Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:45.5075308Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:45.5075584Z 
2025-05-07T20:22:45.5075700Z   git switch -c <new-branch-name>
2025-05-07T20:22:45.5075887Z 
2025-05-07T20:22:45.5076051Z Or undo this operation with:
2025-05-07T20:22:45.5076223Z 
2025-05-07T20:22:45.5076314Z   git switch -
2025-05-07T20:22:45.5076818Z 
2025-05-07T20:22:45.5077042Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:45.5077359Z 
2025-05-07T20:22:45.5077743Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:45.5085086Z ##[endgroup]
2025-05-07T20:22:45.5085503Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:45.5091073Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.5136751Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:45.5167895Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:45.5202282Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:45.5229424Z ##[endgroup]
2025-05-07T20:22:45.5229805Z ##[group]Fetching submodules
2025-05-07T20:22:45.5232608Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:45.5573813Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:45.5905526Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:45.5907195Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:45.5910655Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:45.5914302Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:45.5917967Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:45.5922259Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:45.5925728Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:45.5958264Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:45.9178548Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:46.3948348Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:46.8152721Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:47.8677926Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.1275171Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:48.4365719Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:49.7584821Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:49.7585323Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:49.8083017Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:50.4961178Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:50.4961647Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:50.7810605Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:51.6742306Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:51.6742917Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:51.7822925Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:52.9785300Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:52.9785817Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:53.6780680Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.3658096Z From https://github.com/google/googletest
2025-05-07T20:22:54.3658705Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.4067920Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:54.9715367Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:54.9715963Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:54.9801100Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:55.6483664Z From https://github.com/nlohmann/json
2025-05-07T20:22:55.6484265Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:55.7616766Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:55.7636477Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:55.7977373Z Entering 'external/asmjit'
2025-05-07T20:22:55.8009687Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.8041887Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.8074033Z Entering 'external/cutlass'
2025-05-07T20:22:55.8105906Z Entering 'external/googletest'
2025-05-07T20:22:55.8137662Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.8169898Z Entering 'external/json'
2025-05-07T20:22:55.8215782Z ##[endgroup]
2025-05-07T20:22:55.8216218Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:55.8223518Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:55.8551715Z Entering 'external/asmjit'
2025-05-07T20:22:55.8618113Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.8692011Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.8757925Z Entering 'external/cutlass'
2025-05-07T20:22:55.8831260Z Entering 'external/googletest'
2025-05-07T20:22:55.8897434Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.8966435Z Entering 'external/json'
2025-05-07T20:22:55.9051954Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:55.9380812Z Entering 'external/asmjit'
2025-05-07T20:22:55.9443655Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:55.9445600Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.9506759Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:55.9510194Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.9574659Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:55.9577695Z Entering 'external/cutlass'
2025-05-07T20:22:55.9639823Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:55.9643051Z Entering 'external/googletest'
2025-05-07T20:22:55.9703528Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:55.9706450Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.9767491Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:55.9770437Z Entering 'external/json'
2025-05-07T20:22:55.9830264Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:55.9914589Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:56.0245964Z Entering 'external/asmjit'
2025-05-07T20:22:56.0279181Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.0310534Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.0343075Z Entering 'external/cutlass'
2025-05-07T20:22:56.0374883Z Entering 'external/googletest'
2025-05-07T20:22:56.0407679Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.0438558Z Entering 'external/json'
2025-05-07T20:22:56.0491867Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:56.0812992Z Entering 'external/asmjit'
2025-05-07T20:22:56.0846829Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.0880673Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.0912662Z Entering 'external/cutlass'
2025-05-07T20:22:56.0944557Z Entering 'external/googletest'
2025-05-07T20:22:56.0975960Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.1010699Z Entering 'external/json'
2025-05-07T20:22:56.1070629Z ##[endgroup]
2025-05-07T20:22:56.1091106Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:56.1118215Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:56.1293390Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:56.1293705Z with:
2025-05-07T20:22:56.1293936Z   name: fbgemm_genai_x86_gcc_py3.11_cu12.6.3.whl
2025-05-07T20:22:56.1294243Z   merge-multiple: false
2025-05-07T20:22:56.1294498Z   repository: pytorch/FBGEMM
2025-05-07T20:22:56.1294754Z   run-id: 14891846252
2025-05-07T20:22:56.1294960Z env:
2025-05-07T20:22:56.1295175Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.1295470Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.1295705Z   BUILD_TARGET: genai
2025-05-07T20:22:56.1295919Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.1296156Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.1296397Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.1296630Z ##[endgroup]
2025-05-07T20:22:56.3582202Z Downloading single artifact
2025-05-07T20:22:56.4571986Z Preparing to download the following artifacts:
2025-05-07T20:22:56.4572855Z - fbgemm_genai_x86_gcc_py3.11_cu12.6.3.whl (ID: 3081362046, Size: 12503661, Expected Digest: sha256:62b71de05844c49a64b362ad2b6d2df4fb5f1ee6fe564783afec567436ca2ca9)
2025-05-07T20:22:56.5138235Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-d00c8883-fd0c-5901-9007-a9cd1395759f/artifacts/83ef9f0a55c3787ac5ec90dd5a05156a974c6e4380cbb349c58c6a5843cb1014.zip
2025-05-07T20:22:56.5139913Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:56.5948148Z (node:56950) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:56.5949077Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:56.7748019Z SHA256 digest of downloaded artifact is 62b71de05844c49a64b362ad2b6d2df4fb5f1ee6fe564783afec567436ca2ca9
2025-05-07T20:22:56.7748646Z Artifact download completed successfully.
2025-05-07T20:22:56.7748983Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:56.7753995Z Download artifact has finished successfully
2025-05-07T20:22:56.8021170Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:56.8021560Z with:
2025-05-07T20:22:56.8021771Z   driver-version: 570.133.07
2025-05-07T20:22:56.8022018Z env:
2025-05-07T20:22:56.8022232Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.8022533Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.8022775Z   BUILD_TARGET: genai
2025-05-07T20:22:56.8022996Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.8023234Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.8023488Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.8023721Z ##[endgroup]
2025-05-07T20:22:56.8115166Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:56.8115552Z with:
2025-05-07T20:22:56.8115940Z   timeout_minutes: 10
2025-05-07T20:22:56.8116170Z   max_attempts: 3
2025-05-07T20:22:56.8139498Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:56.8162473Z   retry_wait_seconds: 10
2025-05-07T20:22:56.8162727Z   polling_interval_seconds: 1
2025-05-07T20:22:56.8162979Z   warning_on_retry: true
2025-05-07T20:22:56.8163220Z   continue_on_error: false
2025-05-07T20:22:56.8163540Z env:
2025-05-07T20:22:56.8163758Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.8164056Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.8164293Z   BUILD_TARGET: genai
2025-05-07T20:22:56.8164510Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.8164754Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.8165005Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.8165236Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:56.8165478Z ##[endgroup]
2025-05-07T20:22:56.8970790Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:56.8971607Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:56.8975518Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:57.5353785Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:57.5354492Z No packages marked for removal.
2025-05-07T20:22:57.5417767Z Dependencies resolved.
2025-05-07T20:22:57.5427556Z Nothing to do.
2025-05-07T20:22:57.5427995Z Complete!
2025-05-07T20:22:57.5748833Z + install_nvidia_driver_common
2025-05-07T20:22:57.5753010Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:57.5753420Z + lspci
2025-05-07T20:22:57.5755115Z Before installing NVIDIA driver
2025-05-07T20:22:57.5940995Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:57.5943084Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:57.5944472Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:57.5945463Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:57.5946310Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:57.5947249Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:57.5947950Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:57.5948416Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:57.5948815Z + lsmod
2025-05-07T20:22:57.5985687Z Module                  Size  Used by
2025-05-07T20:22:57.5986096Z xt_conntrack           16384  1
2025-05-07T20:22:57.5986468Z nft_chain_nat          16384  3
2025-05-07T20:22:57.5986874Z xt_MASQUERADE          20480  1
2025-05-07T20:22:57.5987433Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:57.5988320Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:57.5989376Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:57.5990230Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:57.5990835Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:57.5991394Z xfrm_user              57344  1
2025-05-07T20:22:57.5991907Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:57.5992463Z xt_addrtype            16384  2
2025-05-07T20:22:57.5992950Z nft_compat             20480  4
2025-05-07T20:22:57.5993542Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:57.5994345Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:57.5995075Z br_netfilter           36864  0
2025-05-07T20:22:57.5995609Z bridge                323584  1 br_netfilter
2025-05-07T20:22:57.5996203Z stp                    16384  1 bridge
2025-05-07T20:22:57.5996753Z llc                    16384  2 bridge,stp
2025-05-07T20:22:57.5997295Z overlay               167936  0
2025-05-07T20:22:57.5997611Z tls                   135168  0
2025-05-07T20:22:57.5997893Z nls_ascii              16384  1
2025-05-07T20:22:57.5998135Z nls_cp437              20480  1
2025-05-07T20:22:57.5998385Z vfat                   24576  1
2025-05-07T20:22:57.5998633Z fat                    86016  1 vfat
2025-05-07T20:22:57.5998890Z sunrpc                696320  1
2025-05-07T20:22:57.5999140Z ena                   180224  0
2025-05-07T20:22:57.5999379Z i8042                  45056  0
2025-05-07T20:22:57.5999632Z serio                  28672  3 i8042
2025-05-07T20:22:57.5999892Z button                 24576  0
2025-05-07T20:22:57.6000152Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:57.6000430Z dm_mod                188416  0
2025-05-07T20:22:57.6000675Z sch_fq_codel           20480  17
2025-05-07T20:22:57.6000936Z fuse                  163840  1
2025-05-07T20:22:57.6001188Z loop                   36864  0
2025-05-07T20:22:57.6001431Z configfs               57344  1
2025-05-07T20:22:57.6001685Z dax                    45056  1 dm_mod
2025-05-07T20:22:57.6001959Z dmi_sysfs              20480  0
2025-05-07T20:22:57.6002202Z crc32_pclmul           16384  0
2025-05-07T20:22:57.6002455Z crc32c_intel           24576  0
2025-05-07T20:22:57.6002708Z efivarfs               24576  1
2025-05-07T20:22:57.6002952Z + modinfo nvidia
2025-05-07T20:22:57.6003671Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:57.6004338Z import_ns:      DMA_BUF
2025-05-07T20:22:57.6004698Z alias:          char-major-195-*
2025-05-07T20:22:57.6005052Z version:        570.133.07
2025-05-07T20:22:57.6005392Z supported:      external
2025-05-07T20:22:57.6005792Z license:        Dual MIT/GPL
2025-05-07T20:22:57.6006228Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:57.6006680Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:57.6007219Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:57.6007542Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:57.6007904Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:57.6008226Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:57.6008534Z depends:        i2c-core,drm
2025-05-07T20:22:57.6008785Z retpoline:      Y
2025-05-07T20:22:57.6009019Z name:           nvidia
2025-05-07T20:22:57.6009516Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:57.6010146Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:57.6010622Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:57.6011143Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:57.6011451Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:57.6011766Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:57.6012128Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:57.6012550Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:57.6012958Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:57.6013404Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:57.6013787Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:57.6014119Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:57.6014408Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:57.6014711Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:57.6015095Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:57.6015635Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:57.6016131Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:57.6016573Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.6016986Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:57.6017406Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.6017815Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:57.6018146Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:57.6018505Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:57.6018875Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:57.6019214Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:57.6019536Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:57.6019860Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:57.6020179Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:57.6020488Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:57.6020826Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:57.6021199Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:57.6021526Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:57.6021851Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:57.6022199Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:57.6022532Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:57.6022863Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:57.6023196Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:57.6023484Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:57.6023804Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:57.6024117Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:57.6024430Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:57.6024758Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:57.6025104Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:57.6025450Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:57.6025778Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:57.6026112Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:57.6026449Z parm:           rm_firmware_active:charp
2025-05-07T20:22:57.6026852Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:57.6027099Z ++ command -v nvidia-smi
2025-05-07T20:22:57.6027351Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:57.6027612Z + set +e
2025-05-07T20:22:57.6027919Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:22:59.4322929Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:22:59.4324036Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:22:59.4324684Z + '[' 0 -ne 0 ']'
2025-05-07T20:22:59.4325295Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:22:59.4326014Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:22:59.4327232Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:22:59.4328269Z + set -e
2025-05-07T20:22:59.4328980Z + '[' 1 -eq 0 ']'
2025-05-07T20:22:59.4329483Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:22:59.4329961Z + post_install_nvidia_driver_common
2025-05-07T20:22:59.4332184Z + sudo modprobe nvidia
2025-05-07T20:22:59.5631407Z + echo 'After installing NVIDIA driver'
2025-05-07T20:22:59.5631853Z + lspci
2025-05-07T20:22:59.5632120Z After installing NVIDIA driver
2025-05-07T20:22:59.5750827Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:59.5751446Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:59.5752004Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:59.5752524Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:59.5753012Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:59.5753547Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:59.5754053Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:59.5754538Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:59.5754953Z + lsmod
2025-05-07T20:22:59.5782248Z Module                  Size  Used by
2025-05-07T20:22:59.5782551Z nvidia_uvm           1884160  0
2025-05-07T20:22:59.5782951Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:22:59.5783356Z drm                   602112  1 nvidia
2025-05-07T20:22:59.5783769Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:22:59.5784120Z backlight              24576  1 drm
2025-05-07T20:22:59.5784448Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:22:59.5784859Z xt_conntrack           16384  1
2025-05-07T20:22:59.5785217Z nft_chain_nat          16384  3
2025-05-07T20:22:59.5785577Z xt_MASQUERADE          20480  1
2025-05-07T20:22:59.5785901Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:59.5786279Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:59.5786683Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:59.5787130Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:59.5787450Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:59.5787755Z xfrm_user              57344  1
2025-05-07T20:22:59.5788032Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:59.5788321Z xt_addrtype            16384  2
2025-05-07T20:22:59.5788587Z nft_compat             20480  4
2025-05-07T20:22:59.5788904Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:59.5789316Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:59.5789703Z br_netfilter           36864  0
2025-05-07T20:22:59.5789989Z bridge                323584  1 br_netfilter
2025-05-07T20:22:59.5790287Z stp                    16384  1 bridge
2025-05-07T20:22:59.5790567Z llc                    16384  2 bridge,stp
2025-05-07T20:22:59.5790861Z overlay               167936  0
2025-05-07T20:22:59.5791120Z tls                   135168  0
2025-05-07T20:22:59.5791368Z nls_ascii              16384  1
2025-05-07T20:22:59.5791956Z nls_cp437              20480  1
2025-05-07T20:22:59.5792217Z vfat                   24576  1
2025-05-07T20:22:59.5792465Z fat                    86016  1 vfat
2025-05-07T20:22:59.5792737Z sunrpc                696320  1
2025-05-07T20:22:59.5792992Z ena                   180224  0
2025-05-07T20:22:59.5793228Z i8042                  45056  0
2025-05-07T20:22:59.5793485Z serio                  28672  3 i8042
2025-05-07T20:22:59.5793762Z button                 24576  0
2025-05-07T20:22:59.5794015Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:59.5794274Z dm_mod                188416  0
2025-05-07T20:22:59.5794532Z sch_fq_codel           20480  17
2025-05-07T20:22:59.5794795Z fuse                  163840  1
2025-05-07T20:22:59.5795039Z loop                   36864  0
2025-05-07T20:22:59.5795450Z configfs               57344  1
2025-05-07T20:22:59.5795707Z dax                    45056  1 dm_mod
2025-05-07T20:22:59.5795976Z dmi_sysfs              20480  0
2025-05-07T20:22:59.5796231Z crc32_pclmul           16384  0
2025-05-07T20:22:59.5796499Z crc32c_intel           24576  0
2025-05-07T20:22:59.5796751Z efivarfs               24576  1
2025-05-07T20:22:59.5797003Z + modinfo nvidia
2025-05-07T20:22:59.5799146Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:59.5799783Z import_ns:      DMA_BUF
2025-05-07T20:22:59.5800108Z alias:          char-major-195-*
2025-05-07T20:22:59.5800395Z version:        570.133.07
2025-05-07T20:22:59.5800643Z supported:      external
2025-05-07T20:22:59.5800885Z license:        Dual MIT/GPL
2025-05-07T20:22:59.5801170Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:59.5801512Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:59.5801824Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:59.5802149Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:59.5802497Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:59.5802829Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:59.5803138Z depends:        i2c-core,drm
2025-05-07T20:22:59.5803393Z retpoline:      Y
2025-05-07T20:22:59.5803765Z name:           nvidia
2025-05-07T20:22:59.5804206Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:59.5804844Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:59.5805445Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:59.5805860Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:59.5806169Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:59.5806470Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:59.5806776Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:59.5807078Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:59.5807388Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:59.5807748Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:59.5808131Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:59.5808465Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:59.5808764Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:59.5809059Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:59.5809418Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:59.5809809Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:59.5810176Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:59.5810588Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.5810992Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:59.5811408Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.5811808Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:59.5812148Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:59.5812518Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:59.5813012Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:59.5813356Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:59.5813674Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:59.5813996Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:59.5814316Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:59.5814624Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:59.5814967Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:59.5815322Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:59.5815650Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:59.5815984Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:59.5816319Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:59.5816747Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:59.5817085Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:59.5817408Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:59.5817702Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:59.5818027Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:59.5818344Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:59.5818656Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:59.5818982Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:59.5819338Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:59.5819738Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:59.5820056Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:59.5820404Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:59.5820742Z parm:           rm_firmware_active:charp
2025-05-07T20:22:59.5821028Z + set +e
2025-05-07T20:22:59.5821214Z + nvidia-smi
2025-05-07T20:23:00.9905326Z Wed May  7 20:23:00 2025       
2025-05-07T20:23:00.9906026Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:00.9906972Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:00.9907852Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:00.9908717Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:00.9909260Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:00.9909692Z |                                         |                        |               MIG M. |
2025-05-07T20:23:00.9910029Z |=========================================+========================+======================|
2025-05-07T20:23:00.9972264Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:00.9972724Z |  0%   30C    P0             59W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:00.9973108Z |                                         |                        |                  N/A |
2025-05-07T20:23:00.9973503Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:00.9973898Z                                                                                          
2025-05-07T20:23:00.9974478Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:00.9974938Z | Processes:                                                                              |
2025-05-07T20:23:00.9975383Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:00.9975796Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:00.9976138Z |=========================================================================================|
2025-05-07T20:23:00.9977086Z |  No running processes found                                                             |
2025-05-07T20:23:00.9977990Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.4111395Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:02.8220183Z NVIDIA A10G
2025-05-07T20:23:03.0894089Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:03.0894760Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:03.0895070Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:03.0895370Z + set -e
2025-05-07T20:23:03.0895578Z INFO: Ignoring allowed status 0
2025-05-07T20:23:03.0903398Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:03.0906872Z + sudo yum install -y yum-utils
2025-05-07T20:23:03.5437567Z Last metadata expiration check: 0:05:44 ago on Wed May  7 20:17:19 2025.
2025-05-07T20:23:03.5690792Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:03.6084637Z Dependencies resolved.
2025-05-07T20:23:03.6266884Z Nothing to do.
2025-05-07T20:23:03.6267204Z Complete!
2025-05-07T20:23:03.6657709Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:03.6658552Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.6659734Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.9860581Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.0428828Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:04.5174761Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:04.5426696Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:04.5827270Z Dependencies resolved.
2025-05-07T20:23:04.6005771Z ================================================================================
2025-05-07T20:23:04.6006193Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:04.6006600Z ================================================================================
2025-05-07T20:23:04.6006897Z Downgrading:
2025-05-07T20:23:04.6007262Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:04.6007848Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:04.6008197Z 
2025-05-07T20:23:04.6008290Z Transaction Summary
2025-05-07T20:23:04.6008535Z ================================================================================
2025-05-07T20:23:04.6008849Z Downgrade  2 Packages
2025-05-07T20:23:04.6008997Z 
2025-05-07T20:23:04.6009107Z Total download size: 6.8 M
2025-05-07T20:23:04.6010013Z Downloading Packages:
2025-05-07T20:23:04.6651691Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  20 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:04.6836632Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  69 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:04.6845888Z --------------------------------------------------------------------------------
2025-05-07T20:23:04.6848762Z Total                                            82 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:04.6851492Z Running transaction check
2025-05-07T20:23:04.6954484Z Transaction check succeeded.
2025-05-07T20:23:04.6955139Z Running transaction test
2025-05-07T20:23:04.7251537Z Transaction test succeeded.
2025-05-07T20:23:04.7254109Z Running transaction
2025-05-07T20:23:05.2719285Z   Preparing        :                                                        1/1 
2025-05-07T20:23:05.3775721Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:05.3812409Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.4038705Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.4039299Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.4142013Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.4167685Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:06.8024860Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:06.8025469Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:06.8026003Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:06.8026533Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:06.9471506Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:06.9472382Z WARNING:
2025-05-07T20:23:06.9472633Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:06.9472866Z 
2025-05-07T20:23:06.9472964Z   Available Versions:
2025-05-07T20:23:06.9473113Z 
2025-05-07T20:23:06.9473216Z   Version 2023.7.20250331:
2025-05-07T20:23:06.9473528Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:06.9473786Z 
2025-05-07T20:23:06.9473906Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:06.9474114Z 
2025-05-07T20:23:06.9474206Z     Release notes:
2025-05-07T20:23:06.9474607Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:06.9474981Z 
2025-05-07T20:23:06.9475071Z   Version 2023.7.20250414:
2025-05-07T20:23:06.9475377Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:06.9475622Z 
2025-05-07T20:23:06.9475743Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:06.9475949Z 
2025-05-07T20:23:06.9476038Z     Release notes:
2025-05-07T20:23:06.9476441Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:06.9476803Z 
2025-05-07T20:23:06.9476899Z   Version 2023.7.20250428:
2025-05-07T20:23:06.9477302Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:06.9477577Z 
2025-05-07T20:23:06.9477950Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:06.9478218Z 
2025-05-07T20:23:06.9478371Z     Release notes:
2025-05-07T20:23:06.9478819Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:06.9479289Z 
2025-05-07T20:23:06.9479472Z ================================================================================
2025-05-07T20:23:06.9828147Z  
2025-05-07T20:23:06.9828305Z 
2025-05-07T20:23:06.9842142Z Downgraded:
2025-05-07T20:23:06.9842635Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:06.9843221Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:06.9843727Z 
2025-05-07T20:23:06.9843825Z Complete!
2025-05-07T20:23:07.0307921Z + sudo systemctl restart docker
2025-05-07T20:23:10.9842219Z Wed May  7 20:23:10 2025       
2025-05-07T20:23:10.9843002Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:10.9844196Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:10.9845155Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:10.9846133Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:10.9847168Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:10.9848017Z |                                         |                        |               MIG M. |
2025-05-07T20:23:10.9848674Z |=========================================+========================+======================|
2025-05-07T20:23:10.9923719Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:10.9924662Z |  0%   30C    P0             59W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:10.9925076Z |                                         |                        |                  N/A |
2025-05-07T20:23:10.9925473Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:10.9925864Z                                                                                          
2025-05-07T20:23:10.9926245Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:10.9926671Z | Processes:                                                                              |
2025-05-07T20:23:10.9927114Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:10.9927690Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:10.9928039Z |=========================================================================================|
2025-05-07T20:23:10.9928491Z |  No running processes found                                                             |
2025-05-07T20:23:10.9928960Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.8749733Z Command completed after 1 attempt(s).
2025-05-07T20:23:11.8835922Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:11.8836410Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:11.8852435Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:11.8852787Z env:
2025-05-07T20:23:11.8853017Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:11.8853316Z   BUILD_ENV: build_binary
2025-05-07T20:23:11.8853566Z   BUILD_TARGET: genai
2025-05-07T20:23:11.8853808Z   BUILD_VARIANT: cuda
2025-05-07T20:23:11.8854040Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:11.8854297Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:11.8854600Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:11.8854922Z ##[endgroup]
2025-05-07T20:23:12.2233432Z ################################################################################
2025-05-07T20:23:12.2233795Z # Print System Info
2025-05-07T20:23:12.2234013Z #
2025-05-07T20:23:12.2249126Z # [2025-05-07T20:23:12.224Z] + print_system_info 
2025-05-07T20:23:12.2249491Z ################################################################################
2025-05-07T20:23:12.2249712Z 
2025-05-07T20:23:12.2249827Z ################################################################################
2025-05-07T20:23:12.2250163Z [INFO] Printing environment variables ...
2025-05-07T20:23:12.2250464Z + printenv
2025-05-07T20:23:12.2250581Z 
2025-05-07T20:23:12.2275002Z SHELL=/bin/bash
2025-05-07T20:23:12.2275403Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:12.2275971Z BUILD_VARIANT=cuda
2025-05-07T20:23:12.2276690Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_3a5eed80-7251-498b-a987-a21c05c070ae
2025-05-07T20:23:12.2277471Z GITHUB_ACTION=__run
2025-05-07T20:23:12.2277874Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.2278338Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:12.2278662Z RUNNER_NAME=i-03e120d7c73b3b069
2025-05-07T20:23:12.2278956Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:12.2279263Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:12.2279521Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:12.2279892Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:12.2280322Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:12.2280601Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:12.2280920Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:12.2281426Z ***
2025-05-07T20:23:12.2281632Z LOGNAME=ec2-user
2025-05-07T20:23:12.2281864Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:12.2282127Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:12.2282371Z GITHUB_ACTIONS=true
2025-05-07T20:23:12.2282602Z SYSTEMD_EXEC_PID=55511
2025-05-07T20:23:12.2282884Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:12.2283538Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:12.2284054Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:12.2284341Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:12.2284600Z RUNNER_OS=Linux
2025-05-07T20:23:12.2284827Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:12.2285072Z HOME=/home/ec2-user
2025-05-07T20:23:12.2285328Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:12.2285626Z LANG=C.UTF-8
2025-05-07T20:23:12.2285915Z RUNNER_TRACKING_ID=github_04a57729-97cf-41ac-88c5-5ac90b307b9a
2025-05-07T20:23:12.2286269Z RUNNER_ARCH=X64
2025-05-07T20:23:12.2286554Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:12.2287253Z BUILD_TARGET=genai
2025-05-07T20:23:12.2287781Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_3a5eed80-7251-498b-a987-a21c05c070ae
2025-05-07T20:23:12.2288642Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_3a5eed80-7251-498b-a987-a21c05c070ae
2025-05-07T20:23:12.2289374Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:12.2290034Z INVOCATION_ID=4dac1ab9286f4f74ada387b6af3aba5a
2025-05-07T20:23:12.2290367Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:12.2290635Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:12.2291203Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_3a5eed80-7251-498b-a987-a21c05c070ae
2025-05-07T20:23:12.2291814Z BUILD_ENV=build_binary
2025-05-07T20:23:12.2292045Z GITHUB_ACTOR=q10
2025-05-07T20:23:12.2292281Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:12.2292536Z KERN_NAME_LC=linux
2025-05-07T20:23:12.2292768Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:12.2293072Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:12.2293401Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:12.2293671Z USER=ec2-user
2025-05-07T20:23:12.2293992Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:12.2294371Z SHLVL=1
2025-05-07T20:23:12.2294640Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:12.2295060Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:12.2295542Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:12.2295902Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:12.2296144Z KERN_NAME=Linux
2025-05-07T20:23:12.2296371Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:12.2296829Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:12.2297403Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:12.2297679Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:12.2297921Z JOURNAL_STREAM=8:93485
2025-05-07T20:23:12.2298244Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:12.2298605Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:12.2298912Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:12.2299248Z GITHUB_BASE_REF=main
2025-05-07T20:23:12.2299470Z CI=true
2025-05-07T20:23:12.2299673Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:12.2299959Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:12.2300239Z GITHUB_ACTION_REF=
2025-05-07T20:23:12.2300481Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:12.2301089Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_3a5eed80-7251-498b-a987-a21c05c070ae
2025-05-07T20:23:12.2301678Z MACHINE_NAME=x86_64
2025-05-07T20:23:12.2301892Z _=/usr/bin/printenv
2025-05-07T20:23:12.2302034Z 
2025-05-07T20:23:12.2302153Z ################################################################################
2025-05-07T20:23:12.2302477Z [INFO] Print ldd version ...
2025-05-07T20:23:12.2302742Z + ldd --version
2025-05-07T20:23:12.2302876Z 
2025-05-07T20:23:12.2302972Z ldd (GNU libc) 2.34
2025-05-07T20:23:12.2303246Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:12.2303693Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:12.2304221Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:12.2304678Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:12.2304903Z 
2025-05-07T20:23:12.2305018Z ################################################################################
2025-05-07T20:23:12.2305333Z [INFO] Print CPU info ...
2025-05-07T20:23:12.2305571Z + nproc
2025-05-07T20:23:12.2305688Z 
2025-05-07T20:23:12.2318476Z 16
2025-05-07T20:23:12.2320095Z 
2025-05-07T20:23:12.2320410Z + lscpu
2025-05-07T20:23:12.2320542Z 
2025-05-07T20:23:12.2432671Z Architecture:                         x86_64
2025-05-07T20:23:12.2433186Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:12.2433929Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2434323Z Byte Order:                           Little Endian
2025-05-07T20:23:12.2434638Z CPU(s):                               16
2025-05-07T20:23:12.2434923Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:12.2435238Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:12.2435578Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:12.2435887Z CPU family:                           23
2025-05-07T20:23:12.2436322Z Model:                                49
2025-05-07T20:23:12.2436613Z Thread(s) per core:                   2
2025-05-07T20:23:12.2436894Z Core(s) per socket:                   8
2025-05-07T20:23:12.2437177Z Socket(s):                            1
2025-05-07T20:23:12.2437452Z Stepping:                             0
2025-05-07T20:23:12.2437749Z BogoMIPS:                             5599.99
2025-05-07T20:23:12.2440094Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2442198Z Hypervisor vendor:                    KVM
2025-05-07T20:23:12.2442506Z Virtualization type:                  full
2025-05-07T20:23:12.2442845Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.2443215Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.2443693Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:12.2444044Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:12.2444370Z NUMA node(s):                         1
2025-05-07T20:23:12.2444660Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:12.2444994Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:12.2445366Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:12.2445722Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:12.2446065Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:12.2446428Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:12.2446787Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:12.2447149Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:12.2447691Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:12.2448447Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:12.2449210Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:12.2450168Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:12.2451191Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:12.2451869Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:12.2452233Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:12.2452550Z 
2025-05-07T20:23:12.2452640Z + cat /proc/cpuinfo
2025-05-07T20:23:12.2452781Z 
2025-05-07T20:23:12.2452865Z processor	: 0
2025-05-07T20:23:12.2453082Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2453316Z cpu family	: 23
2025-05-07T20:23:12.2453530Z model		: 49
2025-05-07T20:23:12.2453738Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2453974Z stepping	: 0
2025-05-07T20:23:12.2454183Z microcode	: 0x830107f
2025-05-07T20:23:12.2454574Z cpu MHz		: 2359.481
2025-05-07T20:23:12.2454784Z cache size	: 512 KB
2025-05-07T20:23:12.2454999Z physical id	: 0
2025-05-07T20:23:12.2455213Z siblings	: 16
2025-05-07T20:23:12.2455408Z core id		: 0
2025-05-07T20:23:12.2455609Z cpu cores	: 8
2025-05-07T20:23:12.2455808Z apicid		: 0
2025-05-07T20:23:12.2456003Z initial apicid	: 0
2025-05-07T20:23:12.2456218Z fpu		: yes
2025-05-07T20:23:12.2456420Z fpu_exception	: yes
2025-05-07T20:23:12.2456630Z cpuid level	: 13
2025-05-07T20:23:12.2456840Z wp		: yes
2025-05-07T20:23:12.2458939Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2461188Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2461675Z bogomips	: 5599.99
2025-05-07T20:23:12.2461893Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2462133Z clflush size	: 64
2025-05-07T20:23:12.2462358Z cache_alignment	: 64
2025-05-07T20:23:12.2462622Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2462943Z power management:
2025-05-07T20:23:12.2463075Z 
2025-05-07T20:23:12.2463166Z processor	: 1
2025-05-07T20:23:12.2463375Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2463619Z cpu family	: 23
2025-05-07T20:23:12.2463829Z model		: 49
2025-05-07T20:23:12.2464030Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2464278Z stepping	: 0
2025-05-07T20:23:12.2464485Z microcode	: 0x830107f
2025-05-07T20:23:12.2464710Z cpu MHz		: 3286.602
2025-05-07T20:23:12.2464921Z cache size	: 512 KB
2025-05-07T20:23:12.2465144Z physical id	: 0
2025-05-07T20:23:12.2465355Z siblings	: 16
2025-05-07T20:23:12.2465553Z core id		: 1
2025-05-07T20:23:12.2465757Z cpu cores	: 8
2025-05-07T20:23:12.2465960Z apicid		: 2
2025-05-07T20:23:12.2466157Z initial apicid	: 2
2025-05-07T20:23:12.2466370Z fpu		: yes
2025-05-07T20:23:12.2466573Z fpu_exception	: yes
2025-05-07T20:23:12.2466786Z cpuid level	: 13
2025-05-07T20:23:12.2466996Z wp		: yes
2025-05-07T20:23:12.2468951Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2471173Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2471659Z bogomips	: 5599.99
2025-05-07T20:23:12.2471880Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2472118Z clflush size	: 64
2025-05-07T20:23:12.2472331Z cache_alignment	: 64
2025-05-07T20:23:12.2472601Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2472919Z power management:
2025-05-07T20:23:12.2473051Z 
2025-05-07T20:23:12.2473144Z processor	: 2
2025-05-07T20:23:12.2473353Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2473591Z cpu family	: 23
2025-05-07T20:23:12.2473796Z model		: 49
2025-05-07T20:23:12.2474000Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2474243Z stepping	: 0
2025-05-07T20:23:12.2474455Z microcode	: 0x830107f
2025-05-07T20:23:12.2474673Z cpu MHz		: 2775.142
2025-05-07T20:23:12.2474894Z cache size	: 512 KB
2025-05-07T20:23:12.2475110Z physical id	: 0
2025-05-07T20:23:12.2475317Z siblings	: 16
2025-05-07T20:23:12.2475645Z core id		: 2
2025-05-07T20:23:12.2475847Z cpu cores	: 8
2025-05-07T20:23:12.2476048Z apicid		: 4
2025-05-07T20:23:12.2476242Z initial apicid	: 4
2025-05-07T20:23:12.2476459Z fpu		: yes
2025-05-07T20:23:12.2476662Z fpu_exception	: yes
2025-05-07T20:23:12.2476874Z cpuid level	: 13
2025-05-07T20:23:12.2477084Z wp		: yes
2025-05-07T20:23:12.2479116Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2481338Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2481832Z bogomips	: 5599.99
2025-05-07T20:23:12.2482050Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2482288Z clflush size	: 64
2025-05-07T20:23:12.2482514Z cache_alignment	: 64
2025-05-07T20:23:12.2482779Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2483093Z power management:
2025-05-07T20:23:12.2483224Z 
2025-05-07T20:23:12.2483314Z processor	: 3
2025-05-07T20:23:12.2483679Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2483928Z cpu family	: 23
2025-05-07T20:23:12.2484134Z model		: 49
2025-05-07T20:23:12.2484335Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2484573Z stepping	: 0
2025-05-07T20:23:12.2484779Z microcode	: 0x830107f
2025-05-07T20:23:12.2485019Z cpu MHz		: 3298.846
2025-05-07T20:23:12.2485225Z cache size	: 512 KB
2025-05-07T20:23:12.2485441Z physical id	: 0
2025-05-07T20:23:12.2485649Z siblings	: 16
2025-05-07T20:23:12.2485844Z core id		: 3
2025-05-07T20:23:12.2486045Z cpu cores	: 8
2025-05-07T20:23:12.2486257Z apicid		: 6
2025-05-07T20:23:12.2486450Z initial apicid	: 6
2025-05-07T20:23:12.2486663Z fpu		: yes
2025-05-07T20:23:12.2486867Z fpu_exception	: yes
2025-05-07T20:23:12.2487078Z cpuid level	: 13
2025-05-07T20:23:12.2487290Z wp		: yes
2025-05-07T20:23:12.2489249Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2491471Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2491960Z bogomips	: 5599.99
2025-05-07T20:23:12.2492198Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2492461Z clflush size	: 64
2025-05-07T20:23:12.2492680Z cache_alignment	: 64
2025-05-07T20:23:12.2492944Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2493256Z power management:
2025-05-07T20:23:12.2493384Z 
2025-05-07T20:23:12.2493517Z processor	: 4
2025-05-07T20:23:12.2507975Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2508338Z cpu family	: 23
2025-05-07T20:23:12.2508642Z model		: 49
2025-05-07T20:23:12.2508876Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2509121Z stepping	: 0
2025-05-07T20:23:12.2509336Z microcode	: 0x830107f
2025-05-07T20:23:12.2509573Z cpu MHz		: 3302.521
2025-05-07T20:23:12.2509787Z cache size	: 512 KB
2025-05-07T20:23:12.2510006Z physical id	: 0
2025-05-07T20:23:12.2510217Z siblings	: 16
2025-05-07T20:23:12.2510412Z core id		: 4
2025-05-07T20:23:12.2510615Z cpu cores	: 8
2025-05-07T20:23:12.2510823Z apicid		: 8
2025-05-07T20:23:12.2511147Z initial apicid	: 8
2025-05-07T20:23:12.2511366Z fpu		: yes
2025-05-07T20:23:12.2511630Z fpu_exception	: yes
2025-05-07T20:23:12.2511846Z cpuid level	: 13
2025-05-07T20:23:12.2512060Z wp		: yes
2025-05-07T20:23:12.2514111Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2516346Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2516837Z bogomips	: 5599.99
2025-05-07T20:23:12.2517055Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2517299Z clflush size	: 64
2025-05-07T20:23:12.2517523Z cache_alignment	: 64
2025-05-07T20:23:12.2517791Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2518113Z power management:
2025-05-07T20:23:12.2518247Z 
2025-05-07T20:23:12.2518344Z processor	: 5
2025-05-07T20:23:12.2518558Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2518798Z cpu family	: 23
2025-05-07T20:23:12.2519014Z model		: 49
2025-05-07T20:23:12.2519212Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2519469Z stepping	: 0
2025-05-07T20:23:12.2519685Z microcode	: 0x830107f
2025-05-07T20:23:12.2519911Z cpu MHz		: 3295.810
2025-05-07T20:23:12.2520125Z cache size	: 512 KB
2025-05-07T20:23:12.2520346Z physical id	: 0
2025-05-07T20:23:12.2520552Z siblings	: 16
2025-05-07T20:23:12.2520757Z core id		: 5
2025-05-07T20:23:12.2520959Z cpu cores	: 8
2025-05-07T20:23:12.2521157Z apicid		: 10
2025-05-07T20:23:12.2521368Z initial apicid	: 10
2025-05-07T20:23:12.2521585Z fpu		: yes
2025-05-07T20:23:12.2521786Z fpu_exception	: yes
2025-05-07T20:23:12.2522005Z cpuid level	: 13
2025-05-07T20:23:12.2522215Z wp		: yes
2025-05-07T20:23:12.2524317Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2526549Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2527034Z bogomips	: 5599.99
2025-05-07T20:23:12.2527261Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2527508Z clflush size	: 64
2025-05-07T20:23:12.2527730Z cache_alignment	: 64
2025-05-07T20:23:12.2528006Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2528326Z power management:
2025-05-07T20:23:12.2528461Z 
2025-05-07T20:23:12.2528547Z processor	: 6
2025-05-07T20:23:12.2528774Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2529022Z cpu family	: 23
2025-05-07T20:23:12.2529232Z model		: 49
2025-05-07T20:23:12.2529449Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2529695Z stepping	: 0
2025-05-07T20:23:12.2529899Z microcode	: 0x830107f
2025-05-07T20:23:12.2530135Z cpu MHz		: 3308.553
2025-05-07T20:23:12.2530354Z cache size	: 512 KB
2025-05-07T20:23:12.2530572Z physical id	: 0
2025-05-07T20:23:12.2530778Z siblings	: 16
2025-05-07T20:23:12.2530980Z core id		: 6
2025-05-07T20:23:12.2531185Z cpu cores	: 8
2025-05-07T20:23:12.2531390Z apicid		: 12
2025-05-07T20:23:12.2531603Z initial apicid	: 12
2025-05-07T20:23:12.2531817Z fpu		: yes
2025-05-07T20:23:12.2532011Z fpu_exception	: yes
2025-05-07T20:23:12.2532233Z cpuid level	: 13
2025-05-07T20:23:12.2532566Z wp		: yes
2025-05-07T20:23:12.2534597Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2536859Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2537356Z bogomips	: 5599.99
2025-05-07T20:23:12.2537581Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2537812Z clflush size	: 64
2025-05-07T20:23:12.2538034Z cache_alignment	: 64
2025-05-07T20:23:12.2538306Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2538842Z power management:
2025-05-07T20:23:12.2538981Z 
2025-05-07T20:23:12.2539072Z processor	: 7
2025-05-07T20:23:12.2539293Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2539528Z cpu family	: 23
2025-05-07T20:23:12.2539730Z model		: 49
2025-05-07T20:23:12.2539937Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2540182Z stepping	: 0
2025-05-07T20:23:12.2540385Z microcode	: 0x830107f
2025-05-07T20:23:12.2540613Z cpu MHz		: 3299.922
2025-05-07T20:23:12.2540836Z cache size	: 512 KB
2025-05-07T20:23:12.2541046Z physical id	: 0
2025-05-07T20:23:12.2541256Z siblings	: 16
2025-05-07T20:23:12.2541458Z core id		: 7
2025-05-07T20:23:12.2541655Z cpu cores	: 8
2025-05-07T20:23:12.2541864Z apicid		: 14
2025-05-07T20:23:12.2542074Z initial apicid	: 14
2025-05-07T20:23:12.2542282Z fpu		: yes
2025-05-07T20:23:12.2542486Z fpu_exception	: yes
2025-05-07T20:23:12.2542702Z cpuid level	: 13
2025-05-07T20:23:12.2542904Z wp		: yes
2025-05-07T20:23:12.2544856Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2547081Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2547569Z bogomips	: 5599.99
2025-05-07T20:23:12.2547793Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2548024Z clflush size	: 64
2025-05-07T20:23:12.2548242Z cache_alignment	: 64
2025-05-07T20:23:12.2548504Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2548819Z power management:
2025-05-07T20:23:12.2548946Z 
2025-05-07T20:23:12.2549025Z processor	: 8
2025-05-07T20:23:12.2549228Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2549454Z cpu family	: 23
2025-05-07T20:23:12.2549648Z model		: 49
2025-05-07T20:23:12.2549841Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2550070Z stepping	: 0
2025-05-07T20:23:12.2550266Z microcode	: 0x830107f
2025-05-07T20:23:12.2550479Z cpu MHz		: 1994.443
2025-05-07T20:23:12.2550684Z cache size	: 512 KB
2025-05-07T20:23:12.2550887Z physical id	: 0
2025-05-07T20:23:12.2551088Z siblings	: 16
2025-05-07T20:23:12.2551290Z core id		: 0
2025-05-07T20:23:12.2551485Z cpu cores	: 8
2025-05-07T20:23:12.2551673Z apicid		: 1
2025-05-07T20:23:12.2551857Z initial apicid	: 1
2025-05-07T20:23:12.2552068Z fpu		: yes
2025-05-07T20:23:12.2552259Z fpu_exception	: yes
2025-05-07T20:23:12.2552465Z cpuid level	: 13
2025-05-07T20:23:12.2552660Z wp		: yes
2025-05-07T20:23:12.2554593Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2557080Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2557572Z bogomips	: 5599.99
2025-05-07T20:23:12.2557795Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2558025Z clflush size	: 64
2025-05-07T20:23:12.2558241Z cache_alignment	: 64
2025-05-07T20:23:12.2558509Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2558828Z power management:
2025-05-07T20:23:12.2558959Z 
2025-05-07T20:23:12.2559043Z processor	: 9
2025-05-07T20:23:12.2559263Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2559504Z cpu family	: 23
2025-05-07T20:23:12.2559707Z model		: 49
2025-05-07T20:23:12.2559914Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2560155Z stepping	: 0
2025-05-07T20:23:12.2560355Z microcode	: 0x830107f
2025-05-07T20:23:12.2560583Z cpu MHz		: 3280.714
2025-05-07T20:23:12.2560801Z cache size	: 512 KB
2025-05-07T20:23:12.2561010Z physical id	: 0
2025-05-07T20:23:12.2561222Z siblings	: 16
2025-05-07T20:23:12.2561416Z core id		: 1
2025-05-07T20:23:12.2561617Z cpu cores	: 8
2025-05-07T20:23:12.2561818Z apicid		: 3
2025-05-07T20:23:12.2562016Z initial apicid	: 3
2025-05-07T20:23:12.2562221Z fpu		: yes
2025-05-07T20:23:12.2562422Z fpu_exception	: yes
2025-05-07T20:23:12.2562645Z cpuid level	: 13
2025-05-07T20:23:12.2562846Z wp		: yes
2025-05-07T20:23:12.2564902Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2567120Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2567605Z bogomips	: 5599.99
2025-05-07T20:23:12.2567830Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2568059Z clflush size	: 64
2025-05-07T20:23:12.2568276Z cache_alignment	: 64
2025-05-07T20:23:12.2568546Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2568854Z power management:
2025-05-07T20:23:12.2568991Z 
2025-05-07T20:23:12.2569071Z processor	: 10
2025-05-07T20:23:12.2569284Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2569518Z cpu family	: 23
2025-05-07T20:23:12.2569722Z model		: 49
2025-05-07T20:23:12.2569928Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2570161Z stepping	: 0
2025-05-07T20:23:12.2570366Z microcode	: 0x830107f
2025-05-07T20:23:12.2570590Z cpu MHz		: 3260.265
2025-05-07T20:23:12.2570796Z cache size	: 512 KB
2025-05-07T20:23:12.2571009Z physical id	: 0
2025-05-07T20:23:12.2571210Z siblings	: 16
2025-05-07T20:23:12.2571408Z core id		: 2
2025-05-07T20:23:12.2571606Z cpu cores	: 8
2025-05-07T20:23:12.2571803Z apicid		: 5
2025-05-07T20:23:12.2571997Z initial apicid	: 5
2025-05-07T20:23:12.2572221Z fpu		: yes
2025-05-07T20:23:12.2572456Z fpu_exception	: yes
2025-05-07T20:23:12.2572681Z cpuid level	: 13
2025-05-07T20:23:12.2572881Z wp		: yes
2025-05-07T20:23:12.2574983Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2577297Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2577784Z bogomips	: 5599.99
2025-05-07T20:23:12.2578082Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2578321Z clflush size	: 64
2025-05-07T20:23:12.2578537Z cache_alignment	: 64
2025-05-07T20:23:12.2578797Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2579118Z power management:
2025-05-07T20:23:12.2579247Z 
2025-05-07T20:23:12.2579337Z processor	: 11
2025-05-07T20:23:12.2579548Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2579779Z cpu family	: 23
2025-05-07T20:23:12.2579982Z model		: 49
2025-05-07T20:23:12.2580189Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2580422Z stepping	: 0
2025-05-07T20:23:12.2580627Z microcode	: 0x830107f
2025-05-07T20:23:12.2580853Z cpu MHz		: 3292.356
2025-05-07T20:23:12.2581057Z cache size	: 512 KB
2025-05-07T20:23:12.2581268Z physical id	: 0
2025-05-07T20:23:12.2581473Z siblings	: 16
2025-05-07T20:23:12.2581665Z core id		: 3
2025-05-07T20:23:12.2581863Z cpu cores	: 8
2025-05-07T20:23:12.2582058Z apicid		: 7
2025-05-07T20:23:12.2582248Z initial apicid	: 7
2025-05-07T20:23:12.2582459Z fpu		: yes
2025-05-07T20:23:12.2582654Z fpu_exception	: yes
2025-05-07T20:23:12.2582863Z cpuid level	: 13
2025-05-07T20:23:12.2583069Z wp		: yes
2025-05-07T20:23:12.2585003Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2587213Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2587687Z bogomips	: 5599.99
2025-05-07T20:23:12.2587903Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2588136Z clflush size	: 64
2025-05-07T20:23:12.2588346Z cache_alignment	: 64
2025-05-07T20:23:12.2588606Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2588918Z power management:
2025-05-07T20:23:12.2589048Z 
2025-05-07T20:23:12.2589137Z processor	: 12
2025-05-07T20:23:12.2589341Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2589572Z cpu family	: 23
2025-05-07T20:23:12.2589771Z model		: 49
2025-05-07T20:23:12.2589966Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2590209Z stepping	: 0
2025-05-07T20:23:12.2590410Z microcode	: 0x830107f
2025-05-07T20:23:12.2590628Z cpu MHz		: 3302.033
2025-05-07T20:23:12.2590838Z cache size	: 512 KB
2025-05-07T20:23:12.2591047Z physical id	: 0
2025-05-07T20:23:12.2591246Z siblings	: 16
2025-05-07T20:23:12.2591443Z core id		: 4
2025-05-07T20:23:12.2591637Z cpu cores	: 8
2025-05-07T20:23:12.2591828Z apicid		: 9
2025-05-07T20:23:12.2592025Z initial apicid	: 9
2025-05-07T20:23:12.2592232Z fpu		: yes
2025-05-07T20:23:12.2592420Z fpu_exception	: yes
2025-05-07T20:23:12.2592637Z cpuid level	: 13
2025-05-07T20:23:12.2592840Z wp		: yes
2025-05-07T20:23:12.2594775Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2597070Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2597547Z bogomips	: 5599.99
2025-05-07T20:23:12.2597764Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2597998Z clflush size	: 64
2025-05-07T20:23:12.2598208Z cache_alignment	: 64
2025-05-07T20:23:12.2598558Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2598875Z power management:
2025-05-07T20:23:12.2599004Z 
2025-05-07T20:23:12.2599090Z processor	: 13
2025-05-07T20:23:12.2599305Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2599544Z cpu family	: 23
2025-05-07T20:23:12.2599742Z model		: 49
2025-05-07T20:23:12.2599944Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2600180Z stepping	: 0
2025-05-07T20:23:12.2600382Z microcode	: 0x830107f
2025-05-07T20:23:12.2600607Z cpu MHz		: 3297.445
2025-05-07T20:23:12.2600822Z cache size	: 512 KB
2025-05-07T20:23:12.2601032Z physical id	: 0
2025-05-07T20:23:12.2601240Z siblings	: 16
2025-05-07T20:23:12.2601445Z core id		: 5
2025-05-07T20:23:12.2601634Z cpu cores	: 8
2025-05-07T20:23:12.2601833Z apicid		: 11
2025-05-07T20:23:12.2602037Z initial apicid	: 11
2025-05-07T20:23:12.2602269Z fpu		: yes
2025-05-07T20:23:12.2602481Z fpu_exception	: yes
2025-05-07T20:23:12.2602692Z cpuid level	: 13
2025-05-07T20:23:12.2602894Z wp		: yes
2025-05-07T20:23:12.2604946Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2607154Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2607630Z bogomips	: 5599.99
2025-05-07T20:23:12.2607845Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2608069Z clflush size	: 64
2025-05-07T20:23:12.2608287Z cache_alignment	: 64
2025-05-07T20:23:12.2608552Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2608861Z power management:
2025-05-07T20:23:12.2608997Z 
2025-05-07T20:23:12.2609078Z processor	: 14
2025-05-07T20:23:12.2609288Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2609522Z cpu family	: 23
2025-05-07T20:23:12.2609717Z model		: 49
2025-05-07T20:23:12.2609916Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2610153Z stepping	: 0
2025-05-07T20:23:12.2610347Z microcode	: 0x830107f
2025-05-07T20:23:12.2610568Z cpu MHz		: 3300.912
2025-05-07T20:23:12.2610777Z cache size	: 512 KB
2025-05-07T20:23:12.2610985Z physical id	: 0
2025-05-07T20:23:12.2611191Z siblings	: 16
2025-05-07T20:23:12.2611388Z core id		: 6
2025-05-07T20:23:12.2611577Z cpu cores	: 8
2025-05-07T20:23:12.2611771Z apicid		: 13
2025-05-07T20:23:12.2611971Z initial apicid	: 13
2025-05-07T20:23:12.2612175Z fpu		: yes
2025-05-07T20:23:12.2612370Z fpu_exception	: yes
2025-05-07T20:23:12.2612583Z cpuid level	: 13
2025-05-07T20:23:12.2612779Z wp		: yes
2025-05-07T20:23:12.2614715Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2618952Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2619435Z bogomips	: 5599.99
2025-05-07T20:23:12.2619645Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2619880Z clflush size	: 64
2025-05-07T20:23:12.2620094Z cache_alignment	: 64
2025-05-07T20:23:12.2620364Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2620672Z power management:
2025-05-07T20:23:12.2620806Z 
2025-05-07T20:23:12.2620977Z processor	: 15
2025-05-07T20:23:12.2621198Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2621429Z cpu family	: 23
2025-05-07T20:23:12.2621637Z model		: 49
2025-05-07T20:23:12.2621842Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2622077Z stepping	: 0
2025-05-07T20:23:12.2622285Z microcode	: 0x830107f
2025-05-07T20:23:12.2622512Z cpu MHz		: 3294.458
2025-05-07T20:23:12.2622718Z cache size	: 512 KB
2025-05-07T20:23:12.2622930Z physical id	: 0
2025-05-07T20:23:12.2623142Z siblings	: 16
2025-05-07T20:23:12.2623338Z core id		: 7
2025-05-07T20:23:12.2623535Z cpu cores	: 8
2025-05-07T20:23:12.2623731Z apicid		: 15
2025-05-07T20:23:12.2623927Z initial apicid	: 15
2025-05-07T20:23:12.2624141Z fpu		: yes
2025-05-07T20:23:12.2624329Z fpu_exception	: yes
2025-05-07T20:23:12.2624534Z cpuid level	: 13
2025-05-07T20:23:12.2624728Z wp		: yes
2025-05-07T20:23:12.2626678Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2628897Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2629387Z bogomips	: 5599.99
2025-05-07T20:23:12.2629600Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2629833Z clflush size	: 64
2025-05-07T20:23:12.2630044Z cache_alignment	: 64
2025-05-07T20:23:12.2630308Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2630623Z power management:
2025-05-07T20:23:12.2630751Z 
2025-05-07T20:23:12.2630756Z 
2025-05-07T20:23:12.2630883Z ################################################################################
2025-05-07T20:23:12.2631191Z [INFO] Print PCI info ...
2025-05-07T20:23:12.2631428Z + lspci -v
2025-05-07T20:23:12.2631547Z 
2025-05-07T20:23:12.2631759Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:12.2632146Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:12.2632468Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:12.2632674Z 
2025-05-07T20:23:12.2632872Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:12.2633251Z 	Physical Slot: 1
2025-05-07T20:23:12.2633496Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.2633698Z 
2025-05-07T20:23:12.2633951Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:12.2634380Z 	Physical Slot: 1
2025-05-07T20:23:12.2634638Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:12.2634859Z 
2025-05-07T20:23:12.2635129Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:12.2635565Z 	Physical Slot: 3
2025-05-07T20:23:12.2635803Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.2636141Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.2636489Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:12.2636717Z 
2025-05-07T20:23:12.2637017Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.2637611Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.2637897Z 	Physical Slot: 4
2025-05-07T20:23:12.2638148Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:12.2638782Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.2639156Z 	Capabilities: <access denied>
2025-05-07T20:23:12.2639418Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.2639585Z 
2025-05-07T20:23:12.2639945Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.2640429Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.2640778Z 	Physical Slot: 5
2025-05-07T20:23:12.2641014Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.2641365Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.2641747Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.2642068Z 	Capabilities: <access denied>
2025-05-07T20:23:12.2642344Z 	Kernel driver in use: ena
2025-05-07T20:23:12.2642626Z 	Kernel modules: ena
2025-05-07T20:23:12.2642764Z 
2025-05-07T20:23:12.2642931Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:12.2643312Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:12.2643699Z 	Physical Slot: 30
2025-05-07T20:23:12.2643952Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:12.2644323Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:12.2644717Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:12.2645086Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:12.2645413Z 	Capabilities: <access denied>
2025-05-07T20:23:12.2645683Z 	Kernel driver in use: nvidia
2025-05-07T20:23:12.2645938Z 	Kernel modules: nvidia
2025-05-07T20:23:12.2646082Z 
2025-05-07T20:23:12.2646382Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.2646894Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.2647180Z 	Physical Slot: 31
2025-05-07T20:23:12.2647422Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.2647773Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.2648152Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:12.2648480Z 	Capabilities: <access denied>
2025-05-07T20:23:12.2648740Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.2648903Z 
2025-05-07T20:23:12.2648907Z 
2025-05-07T20:23:12.2649021Z ################################################################################
2025-05-07T20:23:12.2649344Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:12.2649624Z + uname -a
2025-05-07T20:23:12.2649746Z 
2025-05-07T20:23:12.2657397Z Linux ip-10-0-57-2.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:12.2657965Z 
2025-05-07T20:23:12.2658057Z + uname -m
2025-05-07T20:23:12.2658191Z 
2025-05-07T20:23:12.2658266Z x86_64
2025-05-07T20:23:12.2658382Z 
2025-05-07T20:23:12.2658467Z + cat /proc/version
2025-05-07T20:23:12.2658597Z 
2025-05-07T20:23:12.2659141Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:12.2659760Z 
2025-05-07T20:23:12.2659847Z + cat /etc/os-release
2025-05-07T20:23:12.2659996Z 
2025-05-07T20:23:12.2660081Z NAME="Amazon Linux"
2025-05-07T20:23:12.2660291Z VERSION="2023"
2025-05-07T20:23:12.2660494Z ID="amzn"
2025-05-07T20:23:12.2660680Z ID_LIKE="fedora"
2025-05-07T20:23:12.2660887Z VERSION_ID="2023"
2025-05-07T20:23:12.2661113Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:12.2661384Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:12.2661668Z ANSI_COLOR="0;33"
2025-05-07T20:23:12.2661917Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:12.2662480Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:12.2662913Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:12.2663325Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:12.2663758Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:12.2664127Z VENDOR_NAME="AWS"
2025-05-07T20:23:12.2664366Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:12.2664655Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:12.2664805Z 
2025-05-07T20:23:12.2665005Z ################################################################################
2025-05-07T20:23:12.2665308Z # Print EC2 Instance Info
2025-05-07T20:23:12.2665541Z #
2025-05-07T20:23:12.2665751Z # [2025-05-07T20:23:12.264Z] + print_ec2_info 
2025-05-07T20:23:12.2666060Z ################################################################################
2025-05-07T20:23:12.2666272Z 
2025-05-07T20:23:12.2769654Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:12.2892038Z instance-id: i-03e120d7c73b3b069
2025-05-07T20:23:12.3012306Z instance-type: g5.4xlarge
2025-05-07T20:23:12.3054932Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:12.3055292Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:12.3064224Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.3064590Z env:
2025-05-07T20:23:12.3064814Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.3065125Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.3065379Z   BUILD_TARGET: genai
2025-05-07T20:23:12.3065615Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.3065871Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:12.3066135Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.3066450Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.3066780Z ##[endgroup]
2025-05-07T20:23:12.6431670Z ################################################################################
2025-05-07T20:23:12.6432128Z [INFO] Printing general display info ...
2025-05-07T20:23:12.6460830Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:12.7607003Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:12.7617571Z /usr/bin/sudo
2025-05-07T20:23:12.7628258Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:12.7638051Z /usr/bin/yum
2025-05-07T20:23:12.7640251Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:12.7660732Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:13.2122012Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:13.2875822Z ================================================================================
2025-05-07T20:23:13.2876302Z WARNING:
2025-05-07T20:23:13.2876694Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:13.2877024Z 
2025-05-07T20:23:13.2877149Z   Available Versions:
2025-05-07T20:23:13.2877360Z 
2025-05-07T20:23:13.2877484Z   Version 2023.7.20250331:
2025-05-07T20:23:13.2877849Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:13.2878116Z 
2025-05-07T20:23:13.2878251Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:13.2878467Z 
2025-05-07T20:23:13.2878564Z     Release notes:
2025-05-07T20:23:13.2878969Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:13.2879340Z 
2025-05-07T20:23:13.2879438Z   Version 2023.7.20250414:
2025-05-07T20:23:13.2879742Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:13.2879995Z 
2025-05-07T20:23:13.2880110Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:13.2880318Z 
2025-05-07T20:23:13.2880408Z     Release notes:
2025-05-07T20:23:13.2880796Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:13.2881167Z 
2025-05-07T20:23:13.2881254Z   Version 2023.7.20250428:
2025-05-07T20:23:13.2881558Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:13.2881805Z 
2025-05-07T20:23:13.2882149Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:13.2882365Z 
2025-05-07T20:23:13.2882451Z     Release notes:
2025-05-07T20:23:13.2882842Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:13.2883204Z 
2025-05-07T20:23:13.2883322Z ================================================================================
2025-05-07T20:23:13.4036255Z Dependencies resolved.
2025-05-07T20:23:13.4324823Z ================================================================================
2025-05-07T20:23:13.4325395Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:13.4325921Z ================================================================================
2025-05-07T20:23:13.4326345Z Upgrading:
2025-05-07T20:23:13.4326698Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:13.4327286Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:13.4327665Z 
2025-05-07T20:23:13.4328072Z Transaction Summary
2025-05-07T20:23:13.4328377Z ================================================================================
2025-05-07T20:23:13.4328807Z Upgrade  2 Packages
2025-05-07T20:23:13.4329001Z 
2025-05-07T20:23:13.4329141Z Total download size: 6.9 M
2025-05-07T20:23:13.4329641Z Downloading Packages:
2025-05-07T20:23:13.4710433Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  34 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:13.5193782Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  67 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:13.5207654Z --------------------------------------------------------------------------------
2025-05-07T20:23:13.5208579Z Total                                            79 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:13.5211031Z Running transaction check
2025-05-07T20:23:13.5305623Z Transaction check succeeded.
2025-05-07T20:23:13.5306149Z Running transaction test
2025-05-07T20:23:13.5601544Z Transaction test succeeded.
2025-05-07T20:23:13.5605269Z Running transaction
2025-05-07T20:23:14.1120392Z   Preparing        :                                                        1/1 
2025-05-07T20:23:14.2190603Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:14.2220910Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.2438314Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.2439453Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.2553897Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.2577340Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.4139515Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:14.4140148Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:14.4140724Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:14.4141255Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:14.5596054Z ================================================================================
2025-05-07T20:23:14.5596423Z WARNING:
2025-05-07T20:23:14.5596672Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:14.5596898Z 
2025-05-07T20:23:14.5596987Z   Available Versions:
2025-05-07T20:23:14.5597140Z 
2025-05-07T20:23:14.5597229Z   Version 2023.7.20250331:
2025-05-07T20:23:14.5597537Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:14.5597784Z 
2025-05-07T20:23:14.5597912Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:14.5598121Z 
2025-05-07T20:23:14.5598205Z     Release notes:
2025-05-07T20:23:14.5598621Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:14.5599267Z 
2025-05-07T20:23:14.5599374Z   Version 2023.7.20250414:
2025-05-07T20:23:14.5599679Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:14.5599926Z 
2025-05-07T20:23:14.5600039Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:14.5600249Z 
2025-05-07T20:23:14.5600333Z     Release notes:
2025-05-07T20:23:14.5600724Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:14.5601085Z 
2025-05-07T20:23:14.5601175Z   Version 2023.7.20250428:
2025-05-07T20:23:14.5601481Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:14.5601731Z 
2025-05-07T20:23:14.5601843Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:14.5602046Z 
2025-05-07T20:23:14.5602137Z     Release notes:
2025-05-07T20:23:14.5602518Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:14.5602892Z 
2025-05-07T20:23:14.5603228Z ================================================================================
2025-05-07T20:23:14.6174219Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.6174554Z 
2025-05-07T20:23:14.6174647Z Upgraded:
2025-05-07T20:23:14.6174985Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:14.6175549Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:14.6175883Z 
2025-05-07T20:23:14.6175975Z Complete!
2025-05-07T20:23:14.6625305Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:14.6651509Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:15.1322888Z Last metadata expiration check: 0:00:11 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:15.1568430Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:15.1965821Z Dependencies resolved.
2025-05-07T20:23:15.2142452Z ================================================================================
2025-05-07T20:23:15.2143008Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:15.2143431Z ================================================================================
2025-05-07T20:23:15.2143734Z Installing:
2025-05-07T20:23:15.2144021Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:15.2144298Z 
2025-05-07T20:23:15.2144389Z Transaction Summary
2025-05-07T20:23:15.2144675Z ================================================================================
2025-05-07T20:23:15.2145101Z Install  1 Package
2025-05-07T20:23:15.2145304Z 
2025-05-07T20:23:15.2145428Z Total download size: 319 k
2025-05-07T20:23:15.2145686Z Installed size: 837 k
2025-05-07T20:23:15.2146735Z Downloading Packages:
2025-05-07T20:23:15.2950459Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        6.4 MB/s | 319 kB     00:00    
2025-05-07T20:23:15.2956319Z --------------------------------------------------------------------------------
2025-05-07T20:23:15.2959119Z Total                                           3.9 MB/s | 319 kB     00:00     
2025-05-07T20:23:15.3119611Z Running transaction check
2025-05-07T20:23:15.3174254Z Transaction check succeeded.
2025-05-07T20:23:15.3174894Z Running transaction test
2025-05-07T20:23:15.3635303Z Transaction test succeeded.
2025-05-07T20:23:15.3639327Z Running transaction
2025-05-07T20:23:15.4641196Z   Preparing        :                                                        1/1 
2025-05-07T20:23:15.5121157Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.6733490Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.7969785Z ================================================================================
2025-05-07T20:23:15.7970142Z WARNING:
2025-05-07T20:23:15.7970384Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:15.7970896Z 
2025-05-07T20:23:15.7970996Z   Available Versions:
2025-05-07T20:23:15.7971158Z 
2025-05-07T20:23:15.7971246Z   Version 2023.7.20250331:
2025-05-07T20:23:15.7971554Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:15.7971807Z 
2025-05-07T20:23:15.7971927Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:15.7972133Z 
2025-05-07T20:23:15.7972223Z     Release notes:
2025-05-07T20:23:15.7972622Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:15.7972993Z 
2025-05-07T20:23:15.7973080Z   Version 2023.7.20250414:
2025-05-07T20:23:15.7973382Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:15.7973624Z 
2025-05-07T20:23:15.7973744Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:15.7973947Z 
2025-05-07T20:23:15.7974032Z     Release notes:
2025-05-07T20:23:15.7974420Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:15.7974788Z 
2025-05-07T20:23:15.7975058Z   Version 2023.7.20250428:
2025-05-07T20:23:15.7975362Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:15.7975612Z 
2025-05-07T20:23:15.7975723Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:15.7975934Z 
2025-05-07T20:23:15.7976017Z     Release notes:
2025-05-07T20:23:15.7976407Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:15.7976766Z 
2025-05-07T20:23:15.7976883Z ================================================================================
2025-05-07T20:23:15.8315810Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.8316301Z 
2025-05-07T20:23:15.8316422Z Installed:
2025-05-07T20:23:15.8316863Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:15.8317286Z 
2025-05-07T20:23:15.8317408Z Complete!
2025-05-07T20:23:15.8762769Z + hostname
2025-05-07T20:23:15.8762969Z 
2025-05-07T20:23:15.8776606Z ip-10-0-57-2.ec2.internal
2025-05-07T20:23:15.8778180Z 
2025-05-07T20:23:15.8778449Z + sudo lshw -C display
2025-05-07T20:23:15.8778607Z 
2025-05-07T20:23:16.4467648Z   *-display:0 UNCLAIMED
2025-05-07T20:23:16.4468106Z        description: VGA compatible controller
2025-05-07T20:23:16.4468564Z        product: Amazon.com, Inc.
2025-05-07T20:23:16.4468947Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:16.4469294Z        physical id: 3
2025-05-07T20:23:16.4469590Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:16.4469852Z        version: 00
2025-05-07T20:23:16.4470063Z        width: 32 bits
2025-05-07T20:23:16.4470292Z        clock: 33MHz
2025-05-07T20:23:16.4470547Z        capabilities: vga_controller bus_master
2025-05-07T20:23:16.4470859Z        configuration: latency=0
2025-05-07T20:23:16.4471188Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:16.4471528Z   *-display:1
2025-05-07T20:23:16.4471757Z        description: 3D controller
2025-05-07T20:23:16.4472062Z        product: GA102GL [A10G]
2025-05-07T20:23:16.4472346Z        vendor: NVIDIA Corporation
2025-05-07T20:23:16.4472613Z        physical id: 1e
2025-05-07T20:23:16.4472845Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:16.4473107Z        version: a1
2025-05-07T20:23:16.4473324Z        width: 64 bits
2025-05-07T20:23:16.4473553Z        clock: 33MHz
2025-05-07T20:23:16.4473885Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:16.4474265Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:16.4474882Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:16.4508293Z 
2025-05-07T20:23:16.4508733Z ################################################################################
2025-05-07T20:23:16.4509200Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:16.4637824Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:16.4808370Z Wed May  7 20:23:16 2025       
2025-05-07T20:23:16.4808899Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.4809593Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:16.4810138Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.4810634Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:16.4811162Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:16.4811586Z |                                         |                        |               MIG M. |
2025-05-07T20:23:16.4811923Z |=========================================+========================+======================|
2025-05-07T20:23:16.4887336Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:16.4888228Z |  0%   31C    P0             57W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:16.4888771Z |                                         |                        |                  N/A |
2025-05-07T20:23:16.4889345Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.4889900Z                                                                                          
2025-05-07T20:23:16.4890321Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.4890741Z | Processes:                                                                              |
2025-05-07T20:23:16.4891181Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:16.4891594Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:16.4891999Z |=========================================================================================|
2025-05-07T20:23:16.4892602Z |  No running processes found                                                             |
2025-05-07T20:23:16.4893255Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.6347745Z ################################################################################
2025-05-07T20:23:16.6348217Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:16.6488217Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.6488997Z [CHECK] rocminfo not found
2025-05-07T20:23:16.6498177Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.6499176Z [CHECK] rocm-smi not found
2025-05-07T20:23:16.6563583Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:16.6564022Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:16.6576450Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:16.6576810Z env:
2025-05-07T20:23:16.6577031Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:16.6577322Z   BUILD_ENV: build_binary
2025-05-07T20:23:16.6577566Z   BUILD_TARGET: genai
2025-05-07T20:23:16.6577790Z   BUILD_VARIANT: cuda
2025-05-07T20:23:16.6578016Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:16.6578273Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:16.6578571Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:16.6578897Z ##[endgroup]
2025-05-07T20:23:16.9946935Z ################################################################################
2025-05-07T20:23:16.9947298Z # Setup Miniconda
2025-05-07T20:23:16.9947527Z #
2025-05-07T20:23:16.9961386Z # [2025-05-07T20:23:16.995Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:16.9961799Z ################################################################################
2025-05-07T20:23:16.9962013Z 
2025-05-07T20:23:16.9975983Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:17.0896867Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:17.0897226Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:17.0897422Z 
2025-05-07T20:23:17.0914462Z 
2025-05-07T20:23:17.0914792Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:17.0938119Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:18.1029006Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:18.1029387Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:18.1029640Z 
2025-05-07T20:23:18.1178403Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:18.5671651Z Unpacking payload ...
2025-05-07T20:23:19.0885062Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:19.9248269Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:22.0254993Z 
2025-05-07T20:23:22.0255652Z Installing base environment...
2025-05-07T20:23:22.0255917Z 
2025-05-07T20:23:23.1070675Z Preparing transaction: ...working... done
2025-05-07T20:23:26.1181783Z Executing transaction: ...working... done
2025-05-07T20:23:26.7808005Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:26.8695862Z installation finished.
2025-05-07T20:23:26.8702575Z 
2025-05-07T20:23:26.8703033Z + rm -f miniconda.sh
2025-05-07T20:23:26.8703297Z 
2025-05-07T20:23:26.9016083Z 
2025-05-07T20:23:26.9016492Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:26.9016952Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:26.9017199Z 
2025-05-07T20:23:27.2682949Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:27.2684239Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:27.2685203Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:27.2686189Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:27.2687039Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:27.2687426Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:27.2687861Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:27.2688302Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:27.2688754Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:27.2689589Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:27.2690121Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:27.2690495Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:27.2690685Z 
2025-05-07T20:23:27.2690881Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:27.2691181Z 
2025-05-07T20:23:27.3364020Z 
2025-05-07T20:23:27.3364525Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:27.3364778Z 
2025-05-07T20:23:28.1790581Z 
2025-05-07T20:23:28.1791542Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:28.1816366Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:41.5555393Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:43.1649092Z Solving environment: / - \ | / - \ | / - \ | done
2025-05-07T20:23:43.2616121Z 
2025-05-07T20:23:43.2616396Z ## Package Plan ##
2025-05-07T20:23:43.2616548Z 
2025-05-07T20:23:43.2616722Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:43.2616957Z 
2025-05-07T20:23:43.2617062Z   added / updated specs:
2025-05-07T20:23:43.2617325Z     - conda-libmamba-solver
2025-05-07T20:23:43.2617583Z     - libarchive
2025-05-07T20:23:43.2617794Z     - libmamba
2025-05-07T20:23:43.2617993Z     - libmambapy
2025-05-07T20:23:43.2618123Z 
2025-05-07T20:23:43.2618127Z 
2025-05-07T20:23:43.2618257Z The following packages will be downloaded:
2025-05-07T20:23:43.2618473Z 
2025-05-07T20:23:43.2618595Z     package                    |            build
2025-05-07T20:23:43.2618907Z     ---------------------------|-----------------
2025-05-07T20:23:43.2619322Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:43.2619795Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:43.2620228Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:43.2620696Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:43.2621146Z     ------------------------------------------------------------
2025-05-07T20:23:43.2621489Z                                            Total:         1.4 MB
2025-05-07T20:23:43.2621698Z 
2025-05-07T20:23:43.2621811Z The following packages will be UPDATED:
2025-05-07T20:23:43.2622021Z 
2025-05-07T20:23:43.2625673Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:43.2626456Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:43.2626839Z 
2025-05-07T20:23:43.2627067Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:43.2627386Z 
2025-05-07T20:23:43.2627714Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:43.2628509Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:43.2628996Z 
2025-05-07T20:23:43.2629000Z 
2025-05-07T20:23:43.2629004Z 
2025-05-07T20:23:43.2629149Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:43.2629530Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:43.2629753Z 
2025-05-07T20:23:43.2630250Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:43.2630496Z 
2025-05-07T20:23:43.2630500Z 
2025-05-07T20:23:43.2640159Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:43.2640432Z 
2025-05-07T20:23:43.2640445Z 
2025-05-07T20:23:43.2640449Z 
2025-05-07T20:23:43.3102383Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:43.3103083Z 
2025-05-07T20:23:43.3165619Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.3165871Z 
2025-05-07T20:23:43.3165875Z 
2025-05-07T20:23:43.3259705Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.3259996Z 
2025-05-07T20:23:43.3274071Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.3274321Z 
2025-05-07T20:23:43.3274325Z 
2025-05-07T20:23:43.3316991Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.3317253Z 
2025-05-07T20:23:43.3317257Z 
2025-05-07T20:23:43.3317261Z 
2025-05-07T20:23:43.3404992Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.3405277Z 
2025-05-07T20:23:43.3405281Z 
2025-05-07T20:23:43.3405288Z 
2025-05-07T20:23:43.3639157Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.3859195Z conda-25.3.1         | 1.1 MB    | #5         |  16% 
2025-05-07T20:23:43.4941074Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.4941461Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.4948062Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.4948387Z                                                      
2025-05-07T20:23:43.4948591Z 
2025-05-07T20:23:43.4948939Z                                                      [A
2025-05-07T20:23:43.4949139Z 
2025-05-07T20:23:43.4949151Z 
2025-05-07T20:23:43.4949322Z                                                      [A[A
2025-05-07T20:23:43.4949525Z 
2025-05-07T20:23:43.4949529Z 
2025-05-07T20:23:43.4949533Z 
2025-05-07T20:23:43.4949744Z                                                      [A[A[A done
2025-05-07T20:23:43.5952119Z Preparing transaction: - done
2025-05-07T20:23:43.6958115Z Verifying transaction: | done
2025-05-07T20:23:45.0978401Z Executing transaction: - \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:46.9498418Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:46.9524129Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:47.8897304Z Channels:
2025-05-07T20:23:47.8897551Z  - defaults
2025-05-07T20:23:47.8898035Z Platform: linux-64
2025-05-07T20:23:49.1364225Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:23:49.2538179Z Solving environment: - \ Channels:
2025-05-07T20:23:49.2538632Z  - defaults
2025-05-07T20:23:49.2539049Z Platform: linux-64
2025-05-07T20:23:49.5510753Z Collecting package metadata (repodata.json): / - \ done
2025-05-07T20:23:49.7666579Z Solving environment: / - \ | done
2025-05-07T20:23:49.8503329Z done
2025-05-07T20:23:49.9161712Z 
2025-05-07T20:23:49.9161946Z ## Package Plan ##
2025-05-07T20:23:49.9162129Z 
2025-05-07T20:23:49.9162326Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:49.9162678Z 
2025-05-07T20:23:49.9162810Z   added / updated specs:
2025-05-07T20:23:49.9163137Z     - conda
2025-05-07T20:23:49.9163300Z 
2025-05-07T20:23:49.9163305Z 
2025-05-07T20:23:49.9163606Z The following packages will be downloaded:
2025-05-07T20:23:49.9163894Z 
2025-05-07T20:23:49.9164045Z     package                    |            build
2025-05-07T20:23:49.9164473Z     ---------------------------|-----------------
2025-05-07T20:23:49.9164883Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:49.9165383Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:49.9165936Z     ------------------------------------------------------------
2025-05-07T20:23:49.9166755Z                                            Total:         1.4 MB
2025-05-07T20:23:49.9166985Z 
2025-05-07T20:23:49.9167120Z The following packages will be UPDATED:
2025-05-07T20:23:49.9167328Z 
2025-05-07T20:23:49.9167636Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:49.9168162Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:49.9168421Z 
2025-05-07T20:23:49.9168425Z 
2025-05-07T20:23:49.9168429Z 
2025-05-07T20:23:49.9168575Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:49.9168952Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:49.9169173Z 
2025-05-07T20:23:49.9637971Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:49.9756061Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:49.9756380Z 
2025-05-07T20:23:50.1644293Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1644942Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.1859533Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.1859915Z 
2025-05-07T20:23:50.1860364Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1861058Z 
2025-05-07T20:23:50.1865350Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1865689Z                                                      
2025-05-07T20:23:50.1865897Z 
2025-05-07T20:23:50.1866071Z                                                      [A done
2025-05-07T20:23:50.2868904Z Preparing transaction: - done
2025-05-07T20:23:50.3874928Z Verifying transaction: | done
2025-05-07T20:23:52.5965187Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:53.2259495Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:53.2263352Z + conda clean --packages --tarball -y
2025-05-07T20:23:53.2263634Z 
2025-05-07T20:23:54.2231088Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:54.2231447Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:54.2901755Z 
2025-05-07T20:23:54.2910318Z + conda clean --all -y
2025-05-07T20:23:54.2910498Z 
2025-05-07T20:23:54.8314164Z There are no unused tarball(s) to remove.
2025-05-07T20:23:54.8314513Z Will remove 1 index cache(s).
2025-05-07T20:23:54.8314829Z There are no unused package(s) to remove.
2025-05-07T20:23:54.8315178Z There are no tempfile(s) to remove.
2025-05-07T20:23:54.8315462Z There are no logfile(s) to remove.
2025-05-07T20:23:54.9004786Z 
2025-05-07T20:23:54.9009865Z + conda info
2025-05-07T20:23:54.9010050Z 
2025-05-07T20:23:55.6807975Z 
2025-05-07T20:23:55.6808573Z      active environment : base
2025-05-07T20:23:55.6808945Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:55.6809305Z             shell level : 1
2025-05-07T20:23:55.6809593Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:55.6809987Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:55.6810415Z           conda version : 25.3.1
2025-05-07T20:23:55.6810715Z     conda-build version : not installed
2025-05-07T20:23:55.6811018Z          python version : 3.13.2.final.0
2025-05-07T20:23:55.6811349Z                  solver : libmamba (default)
2025-05-07T20:23:55.6811670Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:55.6811968Z                           __conda=25.3.1=0
2025-05-07T20:23:55.6812253Z                           __cuda=12.8=0
2025-05-07T20:23:55.6812530Z                           __glibc=2.34=0
2025-05-07T20:23:55.6812805Z                           __linux=6.1.130=0
2025-05-07T20:23:55.6813089Z                           __unix=0=0
2025-05-07T20:23:55.6813445Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:55.6813855Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:55.6814218Z   conda av metadata url : None
2025-05-07T20:23:55.6815036Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:55.6815482Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:55.6815875Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:55.6816282Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:55.6816661Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:55.6817007Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:55.6817362Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:55.6817714Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:55.6818031Z                platform : linux-64
2025-05-07T20:23:55.6818873Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:55.6819706Z                 UID:GID : 1000:1000
2025-05-07T20:23:55.6820005Z              netrc file : None
2025-05-07T20:23:55.6820268Z            offline mode : False
2025-05-07T20:23:55.6820455Z 
2025-05-07T20:23:55.7619215Z 
2025-05-07T20:23:55.7619636Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:55.7621051Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_91bf4d2a-ac78-4419-8bdb-b54df260a85a ...
2025-05-07T20:23:55.7622156Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:55.7707503Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.11
2025-05-07T20:23:55.7707995Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.11[0m
2025-05-07T20:23:55.7725310Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:55.7725667Z env:
2025-05-07T20:23:55.7725895Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:55.7726199Z   BUILD_ENV: build_binary
2025-05-07T20:23:55.7726477Z   BUILD_TARGET: genai
2025-05-07T20:23:55.7726720Z   BUILD_VARIANT: cuda
2025-05-07T20:23:55.7726954Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:55.7727208Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:55.7727511Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:55.7727854Z ##[endgroup]
2025-05-07T20:23:56.1103681Z ################################################################################
2025-05-07T20:23:56.1104044Z # Create Conda Environment
2025-05-07T20:23:56.1104291Z #
2025-05-07T20:23:56.1121005Z # [2025-05-07T20:23:56.111Z] + create_conda_environment build_binary 3.11
2025-05-07T20:23:56.1121482Z ################################################################################
2025-05-07T20:23:56.1121693Z 
2025-05-07T20:23:56.1137782Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:56.2037029Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:56.2037396Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:56.2037703Z + conda info --envs
2025-05-07T20:23:56.2037845Z 
2025-05-07T20:23:56.9792711Z 
2025-05-07T20:23:56.9793256Z # conda environments:
2025-05-07T20:23:56.9793520Z #
2025-05-07T20:23:56.9793739Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:56.9794002Z 
2025-05-07T20:23:57.0514319Z 
2025-05-07T20:23:57.0514874Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:58.7186595Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:58.7186881Z 
2025-05-07T20:23:58.7203032Z 
2025-05-07T20:23:58.7212478Z [SETUP] Creating new Conda environment (Python 3.11) ...
2025-05-07T20:23:58.7234619Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.11
2025-05-07T20:23:59.5024475Z Channels:
2025-05-07T20:23:59.5024786Z  - defaults
2025-05-07T20:23:59.5025072Z Platform: linux-64
2025-05-07T20:24:01.0573880Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:01.1579444Z Solving environment: / done
2025-05-07T20:24:01.1869824Z 
2025-05-07T20:24:01.1870099Z ## Package Plan ##
2025-05-07T20:24:01.1870335Z 
2025-05-07T20:24:01.1870704Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:01.1871135Z 
2025-05-07T20:24:01.1871303Z   added / updated specs:
2025-05-07T20:24:01.1871644Z     - python=3.11
2025-05-07T20:24:01.1871782Z 
2025-05-07T20:24:01.1871787Z 
2025-05-07T20:24:01.1871908Z The following packages will be downloaded:
2025-05-07T20:24:01.1872133Z 
2025-05-07T20:24:01.1872259Z     package                    |            build
2025-05-07T20:24:01.1872587Z     ---------------------------|-----------------
2025-05-07T20:24:01.1872947Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:01.1873342Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:01.1873903Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:01.1874479Z     python-3.11.11             |       he870216_0        32.9 MB
2025-05-07T20:24:01.1875003Z     setuptools-78.1.1          |  py311h06a4308_0         2.3 MB
2025-05-07T20:24:01.1875417Z     wheel-0.45.1               |  py311h06a4308_0         151 KB
2025-05-07T20:24:01.1875782Z     ------------------------------------------------------------
2025-05-07T20:24:01.1876450Z                                            Total:        35.4 MB
2025-05-07T20:24:01.1876658Z 
2025-05-07T20:24:01.1876786Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:01.1877013Z 
2025-05-07T20:24:01.1877412Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:01.1877855Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:01.1878331Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:01.1878804Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:01.1879344Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:01.1879803Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:01.1880233Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.1880763Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:01.1881422Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.1882048Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:01.1882477Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:01.1882888Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:01.1883297Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:01.1883838Z   python             pkgs/main/linux-64::python-3.11.11-he870216_0 
2025-05-07T20:24:01.1884263Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:01.1884735Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py311h06a4308_0 
2025-05-07T20:24:01.1885200Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:01.1885587Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:01.1885971Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:01.1886388Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py311h06a4308_0 
2025-05-07T20:24:01.1886783Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:01.1887159Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:01.1887402Z 
2025-05-07T20:24:01.1887407Z 
2025-05-07T20:24:01.1887411Z 
2025-05-07T20:24:01.1887558Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:01.1887931Z python-3.11.11       | 32.9 MB   |            |   0% 
2025-05-07T20:24:01.1888158Z 
2025-05-07T20:24:01.1888529Z setuptools-78.1.1    | 2.3 MB    |            |   0% [A
2025-05-07T20:24:01.1888768Z 
2025-05-07T20:24:01.1892774Z 
2025-05-07T20:24:01.1903371Z wheel-0.45.1         | 151 KB    |            |   0% [A[A
2025-05-07T20:24:01.1903629Z 
2025-05-07T20:24:01.1903633Z 
2025-05-07T20:24:01.1903883Z 
2025-05-07T20:24:01.1911631Z ca-certificates-2025 | 129 KB    |            |   0% [A[A[A
2025-05-07T20:24:01.1912029Z 
2025-05-07T20:24:01.1912033Z 
2025-05-07T20:24:01.1912037Z 
2025-05-07T20:24:01.1914719Z 
2025-05-07T20:24:01.1935477Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:01.1935848Z 
2025-05-07T20:24:01.1935852Z 
2025-05-07T20:24:01.1935870Z 
2025-05-07T20:24:01.1935874Z 
2025-05-07T20:24:01.1937292Z 
2025-05-07T20:24:01.2233552Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:01.2233837Z 
2025-05-07T20:24:01.2233841Z 
2025-05-07T20:24:01.2236943Z 
2025-05-07T20:24:01.2363603Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.2363900Z 
2025-05-07T20:24:01.2363904Z 
2025-05-07T20:24:01.2363908Z 
2025-05-07T20:24:01.2390962Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.2391233Z 
2025-05-07T20:24:01.2391237Z 
2025-05-07T20:24:01.2391241Z 
2025-05-07T20:24:01.2391245Z 
2025-05-07T20:24:01.2391321Z 
2025-05-07T20:24:01.2440856Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.2441139Z 
2025-05-07T20:24:01.2441143Z 
2025-05-07T20:24:01.2441147Z 
2025-05-07T20:24:01.2441151Z 
2025-05-07T20:24:01.2441155Z 
2025-05-07T20:24:01.2535643Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.2535955Z 
2025-05-07T20:24:01.2589946Z setuptools-78.1.1    | 2.3 MB    | ########## | 100% [A
2025-05-07T20:24:01.2590205Z 
2025-05-07T20:24:01.2590210Z 
2025-05-07T20:24:01.2590214Z 
2025-05-07T20:24:01.2590542Z 
2025-05-07T20:24:01.2873026Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.3090472Z python-3.11.11       | 32.9 MB   | 9          |  10% 
2025-05-07T20:24:01.3090728Z 
2025-05-07T20:24:01.3091096Z 
2025-05-07T20:24:01.3204152Z wheel-0.45.1         | 151 KB    | #          |  11% [A[A
2025-05-07T20:24:01.3204596Z 
2025-05-07T20:24:01.3209560Z 
2025-05-07T20:24:01.3417332Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.3417626Z 
2025-05-07T20:24:01.3417631Z 
2025-05-07T20:24:01.3417635Z 
2025-05-07T20:24:01.3418175Z 
2025-05-07T20:24:01.3424074Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.3424425Z 
2025-05-07T20:24:01.3424429Z 
2025-05-07T20:24:01.3424446Z 
2025-05-07T20:24:01.3424594Z 
2025-05-07T20:24:01.3873896Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.4236307Z python-3.11.11       | 32.9 MB   | ##8        |  28% 
2025-05-07T20:24:01.4237144Z 
2025-05-07T20:24:01.4237149Z 
2025-05-07T20:24:01.4241669Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.4241984Z 
2025-05-07T20:24:01.4241989Z 
2025-05-07T20:24:01.4874549Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.6546362Z python-3.11.11       | 32.9 MB   | ######7    |  68% 
2025-05-07T20:24:01.6579794Z python-3.11.11       | 32.9 MB   | #########3 |  94% 
2025-05-07T20:24:01.6580053Z 
2025-05-07T20:24:01.6581051Z setuptools-78.1.1    | 2.3 MB    | ########## | 100% [A
2025-05-07T20:24:01.6581395Z 
2025-05-07T20:24:01.7141899Z setuptools-78.1.1    | 2.3 MB    | ########## | 100% [A
2025-05-07T20:24:02.3723752Z python-3.11.11       | 32.9 MB   | ########## | 100% 
2025-05-07T20:24:02.3730199Z python-3.11.11       | 32.9 MB   | ########## | 100% 
2025-05-07T20:24:02.3730755Z                                                      
2025-05-07T20:24:02.3731072Z 
2025-05-07T20:24:02.3731380Z                                                      [A
2025-05-07T20:24:02.3731676Z 
2025-05-07T20:24:02.3731681Z 
2025-05-07T20:24:02.3731933Z                                                      [A[A
2025-05-07T20:24:02.3732241Z 
2025-05-07T20:24:02.3732247Z 
2025-05-07T20:24:02.3732252Z 
2025-05-07T20:24:02.3732503Z                                                      [A[A[A
2025-05-07T20:24:02.3732771Z 
2025-05-07T20:24:02.3732775Z 
2025-05-07T20:24:02.3732779Z 
2025-05-07T20:24:02.3732782Z 
2025-05-07T20:24:02.3732980Z                                                      [A[A[A[A
2025-05-07T20:24:02.3733263Z 
2025-05-07T20:24:02.3733269Z 
2025-05-07T20:24:02.3733288Z 
2025-05-07T20:24:02.3733293Z 
2025-05-07T20:24:02.3733299Z 
2025-05-07T20:24:02.3733579Z                                                      [A[A[A[A[A done
2025-05-07T20:24:02.5788301Z Preparing transaction: \ | done
2025-05-07T20:24:03.8266799Z Verifying transaction: - \ | / - \ | / - \ | / done
2025-05-07T20:24:06.1481742Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:06.1985649Z #
2025-05-07T20:24:06.1986133Z # To activate this environment, use
2025-05-07T20:24:06.1986697Z #
2025-05-07T20:24:06.1987088Z #     $ conda activate build_binary
2025-05-07T20:24:06.1987606Z #
2025-05-07T20:24:06.1988025Z # To deactivate an active environment, use
2025-05-07T20:24:06.1988591Z #
2025-05-07T20:24:06.1988963Z #     $ conda deactivate
2025-05-07T20:24:06.1989712Z 
2025-05-07T20:24:06.3172848Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:06.3196105Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:09.2268904Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (25.1)
2025-05-07T20:24:09.2269503Z Collecting pip
2025-05-07T20:24:09.2269832Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:09.2270247Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:09.2271106Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 126.9 MB/s eta 0:00:00
2025-05-07T20:24:09.2271475Z Installing collected packages: pip
2025-05-07T20:24:09.2271766Z   Attempting uninstall: pip
2025-05-07T20:24:09.2272061Z     Found existing installation: pip 25.1
2025-05-07T20:24:09.2272375Z     Uninstalling pip-25.1:
2025-05-07T20:24:09.2272654Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:09.2272982Z Successfully installed pip-25.1.1
2025-05-07T20:24:09.2273180Z 
2025-05-07T20:24:09.2942674Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:09.2965437Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:10.1585252Z Channels:
2025-05-07T20:24:10.1585786Z  - conda-forge
2025-05-07T20:24:10.1586234Z Platform: linux-64
2025-05-07T20:24:20.7473163Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:22.4359261Z Solving environment: - \ | / - \ | done
2025-05-07T20:24:22.5007516Z 
2025-05-07T20:24:22.5007821Z ## Package Plan ##
2025-05-07T20:24:22.5008117Z 
2025-05-07T20:24:22.5008324Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:22.5008648Z 
2025-05-07T20:24:22.5008778Z   added / updated specs:
2025-05-07T20:24:22.5009162Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:22.5009432Z 
2025-05-07T20:24:22.5009470Z 
2025-05-07T20:24:22.5009637Z The following packages will be downloaded:
2025-05-07T20:24:22.5009923Z 
2025-05-07T20:24:22.5010079Z     package                    |            build
2025-05-07T20:24:22.5010463Z     ---------------------------|-----------------
2025-05-07T20:24:22.5010838Z     cffi-1.17.1                |  py311hf29c0ef_0         295 KB  conda-forge
2025-05-07T20:24:22.5011291Z     cryptography-44.0.3        |  py311hafd3f86_0         1.5 MB  conda-forge
2025-05-07T20:24:22.5011727Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:22.5012146Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:22.5012563Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:22.5012976Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:22.5013390Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:22.5013835Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:22.5014267Z     python_abi-3.11            |          2_cp311           5 KB  conda-forge
2025-05-07T20:24:22.5014720Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:22.5015209Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:22.5015633Z     ------------------------------------------------------------
2025-05-07T20:24:22.5015977Z                                            Total:         6.4 MB
2025-05-07T20:24:22.5016187Z 
2025-05-07T20:24:22.5016315Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:22.5016540Z 
2025-05-07T20:24:22.5016746Z   cffi               conda-forge/linux-64::cffi-1.17.1-py311hf29c0ef_0 
2025-05-07T20:24:22.5017244Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py311hafd3f86_0 
2025-05-07T20:24:22.5017743Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:22.5018508Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:22.5018984Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:22.5019595Z   python_abi         conda-forge/linux-64::python_abi-3.11-2_cp311 
2025-05-07T20:24:22.5020314Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:22.5020900Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:22.5021235Z 
2025-05-07T20:24:22.5021356Z The following packages will be UPDATED:
2025-05-07T20:24:22.5021560Z 
2025-05-07T20:24:22.5021951Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:22.5022718Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:22.5023368Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:22.5024009Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:22.5024371Z 
2025-05-07T20:24:22.5024375Z 
2025-05-07T20:24:22.5024379Z 
2025-05-07T20:24:22.5024532Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:22.5024909Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:22.5025142Z 
2025-05-07T20:24:22.5025581Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:22.5025830Z 
2025-05-07T20:24:22.5025841Z 
2025-05-07T20:24:22.5034287Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:22.5034528Z 
2025-05-07T20:24:22.5034532Z 
2025-05-07T20:24:22.5034536Z 
2025-05-07T20:24:22.5042492Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:22.5042778Z 
2025-05-07T20:24:22.5042782Z 
2025-05-07T20:24:22.5042786Z 
2025-05-07T20:24:22.5042802Z 
2025-05-07T20:24:22.5073557Z cffi-1.17.1          | 295 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:22.5073826Z 
2025-05-07T20:24:22.5073830Z 
2025-05-07T20:24:22.5073834Z 
2025-05-07T20:24:22.5073838Z 
2025-05-07T20:24:22.5073842Z 
2025-05-07T20:24:22.5080894Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:22.5081327Z 
2025-05-07T20:24:22.5081333Z 
2025-05-07T20:24:22.5081340Z 
2025-05-07T20:24:22.5081346Z 
2025-05-07T20:24:22.5081351Z 
2025-05-07T20:24:22.5081357Z 
2025-05-07T20:24:22.5085556Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:22.5085903Z 
2025-05-07T20:24:22.5085908Z 
2025-05-07T20:24:22.5085912Z 
2025-05-07T20:24:22.5085915Z 
2025-05-07T20:24:22.5085919Z 
2025-05-07T20:24:22.5085922Z 
2025-05-07T20:24:22.5085926Z 
2025-05-07T20:24:22.5092027Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:22.5092498Z 
2025-05-07T20:24:22.5092504Z 
2025-05-07T20:24:22.5092520Z 
2025-05-07T20:24:22.5092526Z 
2025-05-07T20:24:22.5092530Z 
2025-05-07T20:24:22.5092535Z 
2025-05-07T20:24:22.5092540Z 
2025-05-07T20:24:22.5092545Z 
2025-05-07T20:24:22.5094275Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.5094580Z 
2025-05-07T20:24:22.5094594Z 
2025-05-07T20:24:22.5094598Z 
2025-05-07T20:24:22.5094602Z 
2025-05-07T20:24:22.5094605Z 
2025-05-07T20:24:22.5094609Z 
2025-05-07T20:24:22.5094613Z 
2025-05-07T20:24:22.5094616Z 
2025-05-07T20:24:22.5098556Z 
2025-05-07T20:24:22.5100900Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.5101296Z 
2025-05-07T20:24:22.5101300Z 
2025-05-07T20:24:22.5101311Z 
2025-05-07T20:24:22.5101315Z 
2025-05-07T20:24:22.5101319Z 
2025-05-07T20:24:22.5101323Z 
2025-05-07T20:24:22.5101327Z 
2025-05-07T20:24:22.5101330Z 
2025-05-07T20:24:22.5101334Z 
2025-05-07T20:24:22.5102896Z 
2025-05-07T20:24:22.5690231Z python_abi-3.11      | 5 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.5690822Z 
2025-05-07T20:24:22.5690826Z 
2025-05-07T20:24:22.5690829Z 
2025-05-07T20:24:22.5692910Z 
2025-05-07T20:24:22.6012034Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.6039568Z openssl-3.5.0        | 3.0 MB    | ######     |  61% 
2025-05-07T20:24:22.6039920Z 
2025-05-07T20:24:22.6039924Z 
2025-05-07T20:24:22.6039941Z 
2025-05-07T20:24:22.6076384Z libgomp-15.1.0       | 442 KB    | 3          |   4% [A[A[A
2025-05-07T20:24:22.6076729Z 
2025-05-07T20:24:22.6076733Z 
2025-05-07T20:24:22.6087363Z libgcc-15.1.0        | 810 KB    | #########  |  91% [A[A
2025-05-07T20:24:22.6087681Z 
2025-05-07T20:24:22.6087685Z 
2025-05-07T20:24:22.6087689Z 
2025-05-07T20:24:22.6087693Z 
2025-05-07T20:24:22.6089416Z 
2025-05-07T20:24:22.6155841Z pyopenssl-25.0.0     | 120 KB    | #####3     |  53% [A[A[A[A[A
2025-05-07T20:24:22.6156288Z 
2025-05-07T20:24:22.6156294Z 
2025-05-07T20:24:22.6156299Z 
2025-05-07T20:24:22.6156319Z 
2025-05-07T20:24:22.6156325Z 
2025-05-07T20:24:22.6355196Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.6355477Z 
2025-05-07T20:24:22.6355481Z 
2025-05-07T20:24:22.6399582Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.6399849Z 
2025-05-07T20:24:22.6399854Z 
2025-05-07T20:24:22.6402269Z 
2025-05-07T20:24:22.6408657Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.6411008Z 
2025-05-07T20:24:22.6428966Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.6431950Z 
2025-05-07T20:24:22.6708848Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.6709281Z 
2025-05-07T20:24:22.6709287Z 
2025-05-07T20:24:22.6709293Z 
2025-05-07T20:24:22.6709298Z 
2025-05-07T20:24:22.6709303Z 
2025-05-07T20:24:22.6712272Z 
2025-05-07T20:24:22.6773901Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:22.6774279Z 
2025-05-07T20:24:22.6774306Z 
2025-05-07T20:24:22.6774312Z 
2025-05-07T20:24:22.6774317Z 
2025-05-07T20:24:22.6774322Z 
2025-05-07T20:24:22.6774327Z 
2025-05-07T20:24:22.6774333Z 
2025-05-07T20:24:22.6774338Z 
2025-05-07T20:24:22.6775567Z 
2025-05-07T20:24:22.6779024Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.6779435Z 
2025-05-07T20:24:22.6779441Z 
2025-05-07T20:24:22.6779446Z 
2025-05-07T20:24:22.6779452Z 
2025-05-07T20:24:22.6779458Z 
2025-05-07T20:24:22.6785096Z 
2025-05-07T20:24:22.6828274Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.6828663Z 
2025-05-07T20:24:22.6828669Z 
2025-05-07T20:24:22.6828674Z 
2025-05-07T20:24:22.6828680Z 
2025-05-07T20:24:22.6828695Z 
2025-05-07T20:24:22.6828700Z 
2025-05-07T20:24:22.6828705Z 
2025-05-07T20:24:22.6828710Z 
2025-05-07T20:24:22.6831081Z 
2025-05-07T20:24:22.6947757Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.6948179Z 
2025-05-07T20:24:22.6948185Z 
2025-05-07T20:24:22.6948190Z 
2025-05-07T20:24:22.6948195Z 
2025-05-07T20:24:22.6948201Z 
2025-05-07T20:24:22.6948206Z 
2025-05-07T20:24:22.6951266Z 
2025-05-07T20:24:22.6957168Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:22.6957535Z 
2025-05-07T20:24:22.6957539Z 
2025-05-07T20:24:22.6957543Z 
2025-05-07T20:24:22.6957547Z 
2025-05-07T20:24:22.6957551Z 
2025-05-07T20:24:22.6957554Z 
2025-05-07T20:24:22.6957558Z 
2025-05-07T20:24:22.6963093Z 
2025-05-07T20:24:22.6993135Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7032743Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.7033017Z 
2025-05-07T20:24:22.7033022Z 
2025-05-07T20:24:22.7033028Z 
2025-05-07T20:24:22.7033033Z 
2025-05-07T20:24:22.7033038Z 
2025-05-07T20:24:22.7033043Z 
2025-05-07T20:24:22.7036728Z 
2025-05-07T20:24:22.7058727Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.7059300Z 
2025-05-07T20:24:22.7059304Z 
2025-05-07T20:24:22.7059308Z 
2025-05-07T20:24:22.7059312Z 
2025-05-07T20:24:22.7059323Z 
2025-05-07T20:24:22.7064628Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.7065007Z 
2025-05-07T20:24:22.7065205Z 
2025-05-07T20:24:22.7065211Z 
2025-05-07T20:24:22.7065216Z 
2025-05-07T20:24:22.7065234Z 
2025-05-07T20:24:22.7065239Z 
2025-05-07T20:24:22.7065243Z 
2025-05-07T20:24:22.7065248Z 
2025-05-07T20:24:22.7164998Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7165372Z 
2025-05-07T20:24:22.7165385Z 
2025-05-07T20:24:22.7165389Z 
2025-05-07T20:24:22.7165392Z 
2025-05-07T20:24:22.7170054Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.7170394Z 
2025-05-07T20:24:22.7170398Z 
2025-05-07T20:24:22.7170409Z 
2025-05-07T20:24:22.7170412Z 
2025-05-07T20:24:22.7255518Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.7255897Z 
2025-05-07T20:24:22.7255901Z 
2025-05-07T20:24:22.7255914Z 
2025-05-07T20:24:22.7255918Z 
2025-05-07T20:24:22.7255921Z 
2025-05-07T20:24:22.7255925Z 
2025-05-07T20:24:22.7255929Z 
2025-05-07T20:24:22.7255933Z 
2025-05-07T20:24:22.7255937Z 
2025-05-07T20:24:22.7257187Z 
2025-05-07T20:24:22.7274808Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7275221Z 
2025-05-07T20:24:22.7275225Z 
2025-05-07T20:24:22.7275229Z 
2025-05-07T20:24:22.7275233Z 
2025-05-07T20:24:22.7275237Z 
2025-05-07T20:24:22.7275241Z 
2025-05-07T20:24:22.7275245Z 
2025-05-07T20:24:22.7275248Z 
2025-05-07T20:24:22.7275252Z 
2025-05-07T20:24:22.7275256Z 
2025-05-07T20:24:22.7696157Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7696436Z 
2025-05-07T20:24:22.7696864Z 
2025-05-07T20:24:22.7743125Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.7743455Z 
2025-05-07T20:24:22.7743471Z 
2025-05-07T20:24:22.7743827Z 
2025-05-07T20:24:22.7748103Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.7748481Z 
2025-05-07T20:24:22.7748487Z 
2025-05-07T20:24:22.7748494Z 
2025-05-07T20:24:22.8324063Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.8324570Z 
2025-05-07T20:24:22.8324579Z 
2025-05-07T20:24:22.8324587Z 
2025-05-07T20:24:22.8324595Z 
2025-05-07T20:24:22.8324602Z 
2025-05-07T20:24:22.8324610Z 
2025-05-07T20:24:22.8324618Z 
2025-05-07T20:24:22.8324625Z 
2025-05-07T20:24:22.8324633Z 
2025-05-07T20:24:22.8329467Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.8329749Z 
2025-05-07T20:24:22.8329753Z 
2025-05-07T20:24:22.8329757Z 
2025-05-07T20:24:22.8329761Z 
2025-05-07T20:24:22.8329764Z 
2025-05-07T20:24:22.8329768Z 
2025-05-07T20:24:22.8329772Z 
2025-05-07T20:24:22.8329775Z 
2025-05-07T20:24:22.8329783Z 
2025-05-07T20:24:22.9276396Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.9276812Z 
2025-05-07T20:24:22.9276826Z 
2025-05-07T20:24:22.9276830Z 
2025-05-07T20:24:22.9276834Z 
2025-05-07T20:24:22.9276837Z 
2025-05-07T20:24:22.9276841Z 
2025-05-07T20:24:22.9281312Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.9281742Z 
2025-05-07T20:24:22.9281746Z 
2025-05-07T20:24:22.9281750Z 
2025-05-07T20:24:22.9281754Z 
2025-05-07T20:24:22.9281758Z 
2025-05-07T20:24:22.9281762Z 
2025-05-07T20:24:22.9442379Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.9442790Z 
2025-05-07T20:24:22.9442796Z 
2025-05-07T20:24:22.9442801Z 
2025-05-07T20:24:22.9442806Z 
2025-05-07T20:24:22.9442811Z 
2025-05-07T20:24:22.9442818Z 
2025-05-07T20:24:22.9442910Z 
2025-05-07T20:24:22.9448021Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.9448336Z 
2025-05-07T20:24:22.9448340Z 
2025-05-07T20:24:22.9448541Z 
2025-05-07T20:24:22.9448545Z 
2025-05-07T20:24:22.9448548Z 
2025-05-07T20:24:22.9448552Z 
2025-05-07T20:24:22.9448556Z 
2025-05-07T20:24:22.9636865Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.9637160Z 
2025-05-07T20:24:22.9637164Z 
2025-05-07T20:24:22.9637382Z 
2025-05-07T20:24:22.9637387Z 
2025-05-07T20:24:22.9637391Z 
2025-05-07T20:24:22.9637403Z 
2025-05-07T20:24:22.9637407Z 
2025-05-07T20:24:22.9637411Z 
2025-05-07T20:24:22.9642256Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.9642546Z 
2025-05-07T20:24:22.9642550Z 
2025-05-07T20:24:22.9642563Z 
2025-05-07T20:24:22.9642567Z 
2025-05-07T20:24:22.9642571Z 
2025-05-07T20:24:22.9642574Z 
2025-05-07T20:24:22.9642578Z 
2025-05-07T20:24:22.9644549Z 
2025-05-07T20:24:22.9807220Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.9807582Z 
2025-05-07T20:24:22.9807588Z 
2025-05-07T20:24:22.9807593Z 
2025-05-07T20:24:22.9807611Z 
2025-05-07T20:24:22.9807616Z 
2025-05-07T20:24:22.9807621Z 
2025-05-07T20:24:22.9807626Z 
2025-05-07T20:24:22.9807631Z 
2025-05-07T20:24:22.9807636Z 
2025-05-07T20:24:22.9808918Z 
2025-05-07T20:24:23.0343937Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.0344238Z 
2025-05-07T20:24:23.0391030Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:23.0397138Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:23.0397471Z                                                      
2025-05-07T20:24:23.0397704Z 
2025-05-07T20:24:23.0397888Z                                                      [A
2025-05-07T20:24:23.0398205Z 
2025-05-07T20:24:23.0398210Z 
2025-05-07T20:24:23.0398447Z                                                      [A[A
2025-05-07T20:24:23.0398701Z 
2025-05-07T20:24:23.0398705Z 
2025-05-07T20:24:23.0398708Z 
2025-05-07T20:24:23.0398883Z                                                      [A[A[A
2025-05-07T20:24:23.0399109Z 
2025-05-07T20:24:23.0399113Z 
2025-05-07T20:24:23.0399116Z 
2025-05-07T20:24:23.0399120Z 
2025-05-07T20:24:23.0399290Z                                                      [A[A[A[A
2025-05-07T20:24:23.0399503Z 
2025-05-07T20:24:23.0399507Z 
2025-05-07T20:24:23.0399515Z 
2025-05-07T20:24:23.0399519Z 
2025-05-07T20:24:23.0399522Z 
2025-05-07T20:24:23.0399695Z                                                      [A[A[A[A[A
2025-05-07T20:24:23.0399909Z 
2025-05-07T20:24:23.0399913Z 
2025-05-07T20:24:23.0399916Z 
2025-05-07T20:24:23.0399920Z 
2025-05-07T20:24:23.0399924Z 
2025-05-07T20:24:23.0399927Z 
2025-05-07T20:24:23.0400101Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:23.0400311Z 
2025-05-07T20:24:23.0400321Z 
2025-05-07T20:24:23.0400325Z 
2025-05-07T20:24:23.0400328Z 
2025-05-07T20:24:23.0400332Z 
2025-05-07T20:24:23.0400336Z 
2025-05-07T20:24:23.0400339Z 
2025-05-07T20:24:23.0400515Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:23.0400740Z 
2025-05-07T20:24:23.0400743Z 
2025-05-07T20:24:23.0400747Z 
2025-05-07T20:24:23.0400750Z 
2025-05-07T20:24:23.0400754Z 
2025-05-07T20:24:23.0400757Z 
2025-05-07T20:24:23.0400761Z 
2025-05-07T20:24:23.0400764Z 
2025-05-07T20:24:23.0400949Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:23.0401171Z 
2025-05-07T20:24:23.0401175Z 
2025-05-07T20:24:23.0401178Z 
2025-05-07T20:24:23.0401182Z 
2025-05-07T20:24:23.0401185Z 
2025-05-07T20:24:23.0401189Z 
2025-05-07T20:24:23.0401192Z 
2025-05-07T20:24:23.0401196Z 
2025-05-07T20:24:23.0401199Z 
2025-05-07T20:24:23.0401381Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.0401601Z 
2025-05-07T20:24:23.0401604Z 
2025-05-07T20:24:23.0401608Z 
2025-05-07T20:24:23.0401612Z 
2025-05-07T20:24:23.0401615Z 
2025-05-07T20:24:23.0401619Z 
2025-05-07T20:24:23.0401629Z 
2025-05-07T20:24:23.0401633Z 
2025-05-07T20:24:23.0401847Z 
2025-05-07T20:24:23.0401851Z 
2025-05-07T20:24:23.0402051Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:23.1404090Z Preparing transaction: - done
2025-05-07T20:24:23.2409227Z Verifying transaction: | done
2025-05-07T20:24:24.7437088Z Executing transaction: - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:24.9207573Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:26.6752709Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:26.6766186Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:26.6790206Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:27.5484048Z Channels:
2025-05-07T20:24:27.5484370Z  - conda-forge
2025-05-07T20:24:27.5484695Z Platform: linux-64
2025-05-07T20:24:30.9328510Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:31.3073043Z Solving environment: \ done
2025-05-07T20:24:31.3687192Z 
2025-05-07T20:24:31.3687510Z ## Package Plan ##
2025-05-07T20:24:31.3687734Z 
2025-05-07T20:24:31.3688019Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:31.3688439Z 
2025-05-07T20:24:31.3688567Z   added / updated specs:
2025-05-07T20:24:31.3688927Z     - libxcrypt
2025-05-07T20:24:31.3689060Z 
2025-05-07T20:24:31.3689064Z 
2025-05-07T20:24:31.3689185Z The following packages will be downloaded:
2025-05-07T20:24:31.3689414Z 
2025-05-07T20:24:31.3689531Z     package                    |            build
2025-05-07T20:24:31.3689862Z     ---------------------------|-----------------
2025-05-07T20:24:31.3690242Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:31.3690647Z     ------------------------------------------------------------
2025-05-07T20:24:31.3690994Z                                            Total:          98 KB
2025-05-07T20:24:31.3691212Z 
2025-05-07T20:24:31.3691360Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:31.3691580Z 
2025-05-07T20:24:31.3691818Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:31.3692103Z 
2025-05-07T20:24:31.3692107Z 
2025-05-07T20:24:31.3692111Z 
2025-05-07T20:24:31.3692259Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:31.5408700Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:31.5425896Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:31.5526996Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:31.5530398Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:31.5531310Z                                                      
2025-05-07T20:24:31.5531971Z  done
2025-05-07T20:24:31.6536294Z Preparing transaction: / done
2025-05-07T20:24:31.7537880Z Verifying transaction: \ done
2025-05-07T20:24:31.8543338Z Executing transaction: / done
2025-05-07T20:24:35.3283068Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:35.3283932Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.11/crypt.h
2025-05-07T20:24:35.3284478Z 
2025-05-07T20:24:35.3316873Z 
2025-05-07T20:24:37.0193365Z [SETUP] Installed Python version: Python 3.11.11
2025-05-07T20:24:37.0193824Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:37.0227792Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:37.0228250Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:37.0241023Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:37.0241369Z env:
2025-05-07T20:24:37.0241592Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:37.0241881Z   BUILD_ENV: build_binary
2025-05-07T20:24:37.0242122Z   BUILD_TARGET: genai
2025-05-07T20:24:37.0242347Z   BUILD_VARIANT: cuda
2025-05-07T20:24:37.0242573Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:37.0243014Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:37.0243307Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:37.0243840Z ##[endgroup]
2025-05-07T20:24:37.3641578Z ################################################################################
2025-05-07T20:24:37.3641974Z # Install C/C++ Compilers
2025-05-07T20:24:37.3642224Z #
2025-05-07T20:24:37.3658600Z # [2025-05-07T20:24:37.365Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:37.3659008Z ################################################################################
2025-05-07T20:24:37.3659226Z 
2025-05-07T20:24:37.3673728Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:37.4559871Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:37.4569342Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:37.4589788Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:38.3268373Z Channels:
2025-05-07T20:24:38.3268635Z  - conda-forge
2025-05-07T20:24:38.3268853Z Platform: linux-64
2025-05-07T20:24:41.7230868Z Collecting package metadata (repodata.json): - \ | / - done
2025-05-07T20:24:42.0949865Z Solving environment: | done
2025-05-07T20:24:42.1568506Z 
2025-05-07T20:24:42.1568787Z ## Package Plan ##
2025-05-07T20:24:42.1568945Z 
2025-05-07T20:24:42.1569189Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:42.1569489Z 
2025-05-07T20:24:42.1569582Z   added / updated specs:
2025-05-07T20:24:42.1569844Z     - sysroot_linux-64=2.17
2025-05-07T20:24:42.1570005Z 
2025-05-07T20:24:42.1570009Z 
2025-05-07T20:24:42.1570133Z The following packages will be downloaded:
2025-05-07T20:24:42.1570344Z 
2025-05-07T20:24:42.1570464Z     package                    |            build
2025-05-07T20:24:42.1570775Z     ---------------------------|-----------------
2025-05-07T20:24:42.1571189Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:42.1571687Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:42.1572087Z     ------------------------------------------------------------
2025-05-07T20:24:42.1572424Z                                            Total:        15.4 MB
2025-05-07T20:24:42.1572630Z 
2025-05-07T20:24:42.1572761Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:42.1572984Z 
2025-05-07T20:24:42.1573270Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:42.1573819Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:42.1574129Z 
2025-05-07T20:24:42.1574132Z 
2025-05-07T20:24:42.1574137Z 
2025-05-07T20:24:42.1574282Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:42.1574651Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:42.1574876Z 
2025-05-07T20:24:42.2571977Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:42.3019334Z sysroot_linux-64-2.1 | 14.5 MB   | 6          |   7% 
2025-05-07T20:24:42.3019587Z 
2025-05-07T20:24:42.3077421Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:24:42.3079347Z 
2025-05-07T20:24:42.3585985Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:42.4752672Z sysroot_linux-64-2.1 | 14.5 MB   | ##4        |  24% 
2025-05-07T20:24:42.4947532Z sysroot_linux-64-2.1 | 14.5 MB   | ###7       |  38% 
2025-05-07T20:24:42.4947784Z 
2025-05-07T20:24:42.4948382Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:42.4948758Z 
2025-05-07T20:24:42.5801681Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:42.6973289Z sysroot_linux-64-2.1 | 14.5 MB   | ######9    |  69% 
2025-05-07T20:24:42.7317905Z sysroot_linux-64-2.1 | 14.5 MB   | #########8 |  98% 
2025-05-07T20:24:43.2783329Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:43.2788140Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:43.2789354Z                                                      
2025-05-07T20:24:43.2789718Z 
2025-05-07T20:24:43.2790118Z                                                      [A done
2025-05-07T20:24:43.3793090Z Preparing transaction: - done
2025-05-07T20:24:43.5797243Z Verifying transaction: | / done
2025-05-07T20:24:43.7854591Z Executing transaction: \ | done
2025-05-07T20:24:43.9499720Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:43.9500035Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:45.6498674Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:45.6514098Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:45.6537722Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:46.5443532Z Channels:
2025-05-07T20:24:46.5443940Z  - conda-forge
2025-05-07T20:24:46.5444236Z Platform: linux-64
2025-05-07T20:24:49.9082362Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:50.8899798Z Solving environment: \ | / done
2025-05-07T20:24:50.9550144Z 
2025-05-07T20:24:50.9550573Z ## Package Plan ##
2025-05-07T20:24:50.9550802Z 
2025-05-07T20:24:50.9551078Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:50.9551522Z 
2025-05-07T20:24:50.9551623Z   added / updated specs:
2025-05-07T20:24:50.9551880Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:50.9552072Z 
2025-05-07T20:24:50.9552078Z 
2025-05-07T20:24:50.9552250Z The following packages will be downloaded:
2025-05-07T20:24:50.9552546Z 
2025-05-07T20:24:50.9552703Z     package                    |            build
2025-05-07T20:24:50.9553143Z     ---------------------------|-----------------
2025-05-07T20:24:50.9553687Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:50.9554217Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:50.9554876Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:50.9555380Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:50.9555822Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:50.9556258Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:50.9556691Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:50.9557151Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:50.9557617Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:50.9558056Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:50.9558527Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:50.9559010Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:50.9559420Z     ------------------------------------------------------------
2025-05-07T20:24:50.9559915Z                                            Total:        91.6 MB
2025-05-07T20:24:50.9560202Z 
2025-05-07T20:24:50.9560385Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:50.9561010Z 
2025-05-07T20:24:50.9561309Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:50.9561873Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:50.9562418Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:50.9562941Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:50.9563450Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:50.9564116Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:50.9564869Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:50.9565438Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:50.9565934Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:50.9566489Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:50.9566856Z 
2025-05-07T20:24:50.9566973Z The following packages will be UPDATED:
2025-05-07T20:24:50.9567177Z 
2025-05-07T20:24:50.9567497Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:50.9568214Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:50.9568620Z 
2025-05-07T20:24:50.9568625Z 
2025-05-07T20:24:50.9568629Z 
2025-05-07T20:24:50.9568773Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:50.9569165Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:50.9569413Z 
2025-05-07T20:24:50.9570020Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:50.9570344Z 
2025-05-07T20:24:50.9570353Z 
2025-05-07T20:24:50.9583431Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:50.9583790Z 
2025-05-07T20:24:50.9583807Z 
2025-05-07T20:24:50.9583812Z 
2025-05-07T20:24:50.9593420Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:50.9593791Z 
2025-05-07T20:24:50.9593797Z 
2025-05-07T20:24:50.9593802Z 
2025-05-07T20:24:50.9598767Z 
2025-05-07T20:24:50.9609024Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:50.9609395Z 
2025-05-07T20:24:50.9609401Z 
2025-05-07T20:24:50.9609406Z 
2025-05-07T20:24:50.9609411Z 
2025-05-07T20:24:50.9615244Z 
2025-05-07T20:24:50.9616824Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:50.9617219Z 
2025-05-07T20:24:50.9617229Z 
2025-05-07T20:24:50.9617234Z 
2025-05-07T20:24:50.9617240Z 
2025-05-07T20:24:50.9617245Z 
2025-05-07T20:24:50.9630907Z 
2025-05-07T20:24:50.9632600Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:50.9632997Z 
2025-05-07T20:24:50.9633003Z 
2025-05-07T20:24:50.9633008Z 
2025-05-07T20:24:50.9633018Z 
2025-05-07T20:24:50.9633033Z 
2025-05-07T20:24:50.9633039Z 
2025-05-07T20:24:50.9633044Z 
2025-05-07T20:24:50.9635487Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:50.9635883Z 
2025-05-07T20:24:50.9635889Z 
2025-05-07T20:24:50.9635894Z 
2025-05-07T20:24:50.9635899Z 
2025-05-07T20:24:50.9635904Z 
2025-05-07T20:24:50.9635922Z 
2025-05-07T20:24:50.9635928Z 
2025-05-07T20:24:50.9635933Z 
2025-05-07T20:24:50.9637120Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9637509Z 
2025-05-07T20:24:50.9637521Z 
2025-05-07T20:24:50.9637545Z 
2025-05-07T20:24:50.9637551Z 
2025-05-07T20:24:50.9637557Z 
2025-05-07T20:24:50.9637562Z 
2025-05-07T20:24:50.9637567Z 
2025-05-07T20:24:50.9637573Z 
2025-05-07T20:24:50.9644540Z 
2025-05-07T20:24:50.9668998Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9669396Z 
2025-05-07T20:24:50.9669402Z 
2025-05-07T20:24:50.9669407Z 
2025-05-07T20:24:50.9669412Z 
2025-05-07T20:24:50.9670356Z 
2025-05-07T20:24:50.9670366Z 
2025-05-07T20:24:50.9670371Z 
2025-05-07T20:24:50.9670376Z 
2025-05-07T20:24:50.9670381Z 
2025-05-07T20:24:50.9670386Z 
2025-05-07T20:24:50.9671263Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9671659Z 
2025-05-07T20:24:50.9671665Z 
2025-05-07T20:24:50.9671670Z 
2025-05-07T20:24:50.9671675Z 
2025-05-07T20:24:50.9671681Z 
2025-05-07T20:24:50.9671686Z 
2025-05-07T20:24:50.9671691Z 
2025-05-07T20:24:50.9671696Z 
2025-05-07T20:24:50.9671701Z 
2025-05-07T20:24:50.9671706Z 
2025-05-07T20:24:50.9671887Z 
2025-05-07T20:24:51.1125041Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.1125495Z 
2025-05-07T20:24:51.1125500Z 
2025-05-07T20:24:51.1125506Z 
2025-05-07T20:24:51.1176467Z 
2025-05-07T20:24:51.1217642Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:51.1218029Z 
2025-05-07T20:24:51.1218035Z 
2025-05-07T20:24:51.1219415Z 
2025-05-07T20:24:51.1233133Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:51.1233531Z 
2025-05-07T20:24:51.1233536Z 
2025-05-07T20:24:51.1754805Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:51.1755202Z 
2025-05-07T20:24:51.1755221Z 
2025-05-07T20:24:51.1755226Z 
2025-05-07T20:24:51.1755231Z 
2025-05-07T20:24:51.1762534Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:51.1762901Z 
2025-05-07T20:24:51.1835219Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:51.2173956Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:51.2174299Z 
2025-05-07T20:24:51.2174306Z 
2025-05-07T20:24:51.2174311Z 
2025-05-07T20:24:51.2174316Z 
2025-05-07T20:24:51.2176629Z 
2025-05-07T20:24:51.2225444Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:51.2225838Z 
2025-05-07T20:24:51.2225844Z 
2025-05-07T20:24:51.2226596Z 
2025-05-07T20:24:51.2235251Z binutils_impl_linux- | 6.0 MB    | #######1   |  71% [A[A[A
2025-05-07T20:24:51.2235626Z 
2025-05-07T20:24:51.2236381Z 
2025-05-07T20:24:51.2765262Z libstdcxx-devel_linu | 11.1 MB   | ##5        |  25% [A[A
2025-05-07T20:24:51.2768235Z 
2025-05-07T20:24:51.2836770Z gxx_impl_linux-64-11 | 11.2 MB   | ###2       |  32% [A
2025-05-07T20:24:51.3175001Z gcc_impl_linux-64-11 | 53.0 MB   | 4          |   5% 
2025-05-07T20:24:51.3175352Z 
2025-05-07T20:24:51.3175359Z 
2025-05-07T20:24:51.3175377Z 
2025-05-07T20:24:51.3175383Z 
2025-05-07T20:24:51.3175394Z 
2025-05-07T20:24:51.3235440Z libsanitizer-11.4.0  | 3.5 MB    | #########2 |  93% [A[A[A[A[A
2025-05-07T20:24:51.3235869Z 
2025-05-07T20:24:51.3236189Z 
2025-05-07T20:24:51.3765712Z libstdcxx-devel_linu | 11.1 MB   | #####1     |  52% [A[A
2025-05-07T20:24:51.3766857Z 
2025-05-07T20:24:51.3839645Z gxx_impl_linux-64-11 | 11.2 MB   | ######4    |  64% [A
2025-05-07T20:24:51.4011252Z gcc_impl_linux-64-11 | 53.0 MB   | #1         |  11% 
2025-05-07T20:24:51.4011599Z 
2025-05-07T20:24:51.4011606Z 
2025-05-07T20:24:51.4011612Z 
2025-05-07T20:24:51.4011629Z 
2025-05-07T20:24:51.4014227Z 
2025-05-07T20:24:51.4236468Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:51.4236871Z 
2025-05-07T20:24:51.4237216Z 
2025-05-07T20:24:51.4476263Z libstdcxx-devel_linu | 11.1 MB   | #######9   |  79% [A[A
2025-05-07T20:24:51.4476634Z 
2025-05-07T20:24:51.4476639Z 
2025-05-07T20:24:51.4481573Z 
2025-05-07T20:24:51.4531918Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:51.4532309Z 
2025-05-07T20:24:51.4532326Z 
2025-05-07T20:24:51.4532332Z 
2025-05-07T20:24:51.4532337Z 
2025-05-07T20:24:51.4532342Z 
2025-05-07T20:24:51.4532358Z 
2025-05-07T20:24:51.4835716Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:51.4836132Z 
2025-05-07T20:24:51.4846827Z gxx_impl_linux-64-11 | 11.2 MB   | ########9  |  89% [A
2025-05-07T20:24:51.5082092Z gcc_impl_linux-64-11 | 53.0 MB   | #7         |  18% 
2025-05-07T20:24:51.5082449Z 
2025-05-07T20:24:51.5082455Z 
2025-05-07T20:24:51.5082460Z 
2025-05-07T20:24:51.5082466Z 
2025-05-07T20:24:51.5082471Z 
2025-05-07T20:24:51.5082476Z 
2025-05-07T20:24:51.5088061Z 
2025-05-07T20:24:51.5777260Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:51.5777665Z 
2025-05-07T20:24:51.5777671Z 
2025-05-07T20:24:51.5777676Z 
2025-05-07T20:24:51.5777681Z 
2025-05-07T20:24:51.5777686Z 
2025-05-07T20:24:51.5777692Z 
2025-05-07T20:24:51.5777697Z 
2025-05-07T20:24:51.5842358Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:51.6214055Z gcc_impl_linux-64-11 | 53.0 MB   | ##3        |  23% 
2025-05-07T20:24:51.6214389Z 
2025-05-07T20:24:51.6214395Z 
2025-05-07T20:24:51.6214400Z 
2025-05-07T20:24:51.6214406Z 
2025-05-07T20:24:51.6214411Z 
2025-05-07T20:24:51.6217708Z 
2025-05-07T20:24:51.6218276Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:51.6218581Z 
2025-05-07T20:24:51.6218591Z 
2025-05-07T20:24:51.6218595Z 
2025-05-07T20:24:51.6218599Z 
2025-05-07T20:24:51.6218603Z 
2025-05-07T20:24:51.6218607Z 
2025-05-07T20:24:51.6286633Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:51.6286975Z 
2025-05-07T20:24:51.6286979Z 
2025-05-07T20:24:51.6286983Z 
2025-05-07T20:24:51.6286986Z 
2025-05-07T20:24:51.6286990Z 
2025-05-07T20:24:51.6286994Z 
2025-05-07T20:24:51.6286998Z 
2025-05-07T20:24:51.6287001Z 
2025-05-07T20:24:51.6320116Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6320473Z 
2025-05-07T20:24:51.6320477Z 
2025-05-07T20:24:51.6320481Z 
2025-05-07T20:24:51.6320484Z 
2025-05-07T20:24:51.6320488Z 
2025-05-07T20:24:51.6320492Z 
2025-05-07T20:24:51.6320495Z 
2025-05-07T20:24:51.6321951Z 
2025-05-07T20:24:51.6618462Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6618867Z 
2025-05-07T20:24:51.6618880Z 
2025-05-07T20:24:51.6618884Z 
2025-05-07T20:24:51.6618887Z 
2025-05-07T20:24:51.6618900Z 
2025-05-07T20:24:51.6618903Z 
2025-05-07T20:24:51.6618907Z 
2025-05-07T20:24:51.6618910Z 
2025-05-07T20:24:51.6620228Z 
2025-05-07T20:24:51.6646798Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6647241Z 
2025-05-07T20:24:51.6647245Z 
2025-05-07T20:24:51.6647249Z 
2025-05-07T20:24:51.6647252Z 
2025-05-07T20:24:51.6647256Z 
2025-05-07T20:24:51.6647260Z 
2025-05-07T20:24:51.6647264Z 
2025-05-07T20:24:51.6647267Z 
2025-05-07T20:24:51.6647904Z 
2025-05-07T20:24:51.6843213Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6874603Z gcc_impl_linux-64-11 | 53.0 MB   | ###        |  30% 
2025-05-07T20:24:51.6874902Z 
2025-05-07T20:24:51.6874907Z 
2025-05-07T20:24:51.6874913Z 
2025-05-07T20:24:51.6874918Z 
2025-05-07T20:24:51.6874923Z 
2025-05-07T20:24:51.6874928Z 
2025-05-07T20:24:51.6874934Z 
2025-05-07T20:24:51.6874949Z 
2025-05-07T20:24:51.6874955Z 
2025-05-07T20:24:51.6876460Z 
2025-05-07T20:24:51.6913359Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6913673Z 
2025-05-07T20:24:51.6913677Z 
2025-05-07T20:24:51.6913681Z 
2025-05-07T20:24:51.6913685Z 
2025-05-07T20:24:51.6913688Z 
2025-05-07T20:24:51.6913692Z 
2025-05-07T20:24:51.6913696Z 
2025-05-07T20:24:51.6913699Z 
2025-05-07T20:24:51.6913703Z 
2025-05-07T20:24:51.6915299Z 
2025-05-07T20:24:51.6955528Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6955874Z 
2025-05-07T20:24:51.6955879Z 
2025-05-07T20:24:51.6955885Z 
2025-05-07T20:24:51.6955890Z 
2025-05-07T20:24:51.6963207Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:51.6963499Z 
2025-05-07T20:24:51.6963503Z 
2025-05-07T20:24:51.6963507Z 
2025-05-07T20:24:51.6963511Z 
2025-05-07T20:24:51.7024202Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:51.7024494Z 
2025-05-07T20:24:51.7024498Z 
2025-05-07T20:24:51.7024502Z 
2025-05-07T20:24:51.7024505Z 
2025-05-07T20:24:51.7024509Z 
2025-05-07T20:24:51.7024513Z 
2025-05-07T20:24:51.7024517Z 
2025-05-07T20:24:51.7024520Z 
2025-05-07T20:24:51.7024524Z 
2025-05-07T20:24:51.7024528Z 
2025-05-07T20:24:51.7024532Z 
2025-05-07T20:24:51.7065043Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.7065457Z 
2025-05-07T20:24:51.7065462Z 
2025-05-07T20:24:51.7065466Z 
2025-05-07T20:24:51.7065469Z 
2025-05-07T20:24:51.7065473Z 
2025-05-07T20:24:51.7065654Z 
2025-05-07T20:24:51.7065658Z 
2025-05-07T20:24:51.7065662Z 
2025-05-07T20:24:51.7065665Z 
2025-05-07T20:24:51.7065677Z 
2025-05-07T20:24:51.7066197Z 
2025-05-07T20:24:51.7846367Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.8076366Z gcc_impl_linux-64-11 | 53.0 MB   | ###7       |  38% 
2025-05-07T20:24:51.8076600Z 
2025-05-07T20:24:51.8076717Z 
2025-05-07T20:24:51.8300418Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:51.8300706Z 
2025-05-07T20:24:51.8300711Z 
2025-05-07T20:24:51.8300715Z 
2025-05-07T20:24:51.8300719Z 
2025-05-07T20:24:51.8300723Z 
2025-05-07T20:24:51.8300727Z 
2025-05-07T20:24:51.8302866Z 
2025-05-07T20:24:51.8315522Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:51.8315896Z 
2025-05-07T20:24:51.8315900Z 
2025-05-07T20:24:51.8315904Z 
2025-05-07T20:24:51.8315908Z 
2025-05-07T20:24:51.8315912Z 
2025-05-07T20:24:51.8315915Z 
2025-05-07T20:24:51.8316453Z 
2025-05-07T20:24:51.8592470Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:51.8594637Z 
2025-05-07T20:24:51.8846050Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:51.9584001Z gcc_impl_linux-64-11 | 53.0 MB   | ####7      |  48% 
2025-05-07T20:24:51.9584361Z 
2025-05-07T20:24:51.9584368Z 
2025-05-07T20:24:51.9584373Z 
2025-05-07T20:24:51.9584394Z 
2025-05-07T20:24:51.9585140Z 
2025-05-07T20:24:51.9848142Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:52.0169427Z gcc_impl_linux-64-11 | 53.0 MB   | #####9     |  59% 
2025-05-07T20:24:52.0169777Z 
2025-05-07T20:24:52.0169783Z 
2025-05-07T20:24:52.0169789Z 
2025-05-07T20:24:52.0169794Z 
2025-05-07T20:24:52.0169799Z 
2025-05-07T20:24:52.0169804Z 
2025-05-07T20:24:52.0169809Z 
2025-05-07T20:24:52.0170724Z 
2025-05-07T20:24:52.0175517Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.0175922Z 
2025-05-07T20:24:52.0175926Z 
2025-05-07T20:24:52.0175930Z 
2025-05-07T20:24:52.0175934Z 
2025-05-07T20:24:52.0175937Z 
2025-05-07T20:24:52.0175941Z 
2025-05-07T20:24:52.0175945Z 
2025-05-07T20:24:52.0175948Z 
2025-05-07T20:24:52.0490953Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.0491371Z 
2025-05-07T20:24:52.0491375Z 
2025-05-07T20:24:52.0491379Z 
2025-05-07T20:24:52.0491400Z 
2025-05-07T20:24:52.0491404Z 
2025-05-07T20:24:52.0491611Z 
2025-05-07T20:24:52.0849031Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:52.0943868Z gcc_impl_linux-64-11 | 53.0 MB   | ######8    |  69% 
2025-05-07T20:24:52.0944202Z 
2025-05-07T20:24:52.0944208Z 
2025-05-07T20:24:52.0944214Z 
2025-05-07T20:24:52.0944226Z 
2025-05-07T20:24:52.0944231Z 
2025-05-07T20:24:52.0944236Z 
2025-05-07T20:24:52.0944241Z 
2025-05-07T20:24:52.0944246Z 
2025-05-07T20:24:52.0945211Z 
2025-05-07T20:24:52.0951327Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.0951645Z 
2025-05-07T20:24:52.0951649Z 
2025-05-07T20:24:52.0951653Z 
2025-05-07T20:24:52.0951657Z 
2025-05-07T20:24:52.0951661Z 
2025-05-07T20:24:52.0951664Z 
2025-05-07T20:24:52.0951668Z 
2025-05-07T20:24:52.0951672Z 
2025-05-07T20:24:52.0952167Z 
2025-05-07T20:24:52.1269090Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.1269388Z 
2025-05-07T20:24:52.1269392Z 
2025-05-07T20:24:52.1269396Z 
2025-05-07T20:24:52.1269400Z 
2025-05-07T20:24:52.1269403Z 
2025-05-07T20:24:52.1269407Z 
2025-05-07T20:24:52.1269410Z 
2025-05-07T20:24:52.1269414Z 
2025-05-07T20:24:52.1269418Z 
2025-05-07T20:24:52.1272065Z 
2025-05-07T20:24:52.1277207Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.1277546Z 
2025-05-07T20:24:52.1277550Z 
2025-05-07T20:24:52.1277553Z 
2025-05-07T20:24:52.1277557Z 
2025-05-07T20:24:52.1277560Z 
2025-05-07T20:24:52.1277564Z 
2025-05-07T20:24:52.1277782Z 
2025-05-07T20:24:52.1277786Z 
2025-05-07T20:24:52.1277790Z 
2025-05-07T20:24:52.1277793Z 
2025-05-07T20:24:52.1607496Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.1607777Z 
2025-05-07T20:24:52.1607781Z 
2025-05-07T20:24:52.1607786Z 
2025-05-07T20:24:52.1607789Z 
2025-05-07T20:24:52.1607793Z 
2025-05-07T20:24:52.1607796Z 
2025-05-07T20:24:52.1607810Z 
2025-05-07T20:24:52.1607814Z 
2025-05-07T20:24:52.1607818Z 
2025-05-07T20:24:52.1607829Z 
2025-05-07T20:24:52.1608155Z 
2025-05-07T20:24:52.1616605Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.1616988Z 
2025-05-07T20:24:52.1616993Z 
2025-05-07T20:24:52.1617005Z 
2025-05-07T20:24:52.1617009Z 
2025-05-07T20:24:52.1617013Z 
2025-05-07T20:24:52.1617016Z 
2025-05-07T20:24:52.1617020Z 
2025-05-07T20:24:52.1617024Z 
2025-05-07T20:24:52.1617027Z 
2025-05-07T20:24:52.1617031Z 
2025-05-07T20:24:52.1617590Z 
2025-05-07T20:24:52.1853898Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.2856041Z gcc_impl_linux-64-11 | 53.0 MB   | #######8   |  78% 
2025-05-07T20:24:52.4227607Z gcc_impl_linux-64-11 | 53.0 MB   | ########8  |  88% 
2025-05-07T20:24:52.4227970Z 
2025-05-07T20:24:52.4227977Z 
2025-05-07T20:24:52.4227982Z 
2025-05-07T20:24:52.5664674Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:52.5665107Z 
2025-05-07T20:24:52.5676824Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:52.5677344Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:52.8909057Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:52.8909407Z 
2025-05-07T20:24:52.8909413Z 
2025-05-07T20:24:53.3135983Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:53.3143225Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:53.3143706Z                                                      
2025-05-07T20:24:53.3144022Z 
2025-05-07T20:24:53.3144275Z                                                      [A
2025-05-07T20:24:53.3144556Z 
2025-05-07T20:24:53.3144563Z 
2025-05-07T20:24:53.3144796Z                                                      [A[A
2025-05-07T20:24:53.3145086Z 
2025-05-07T20:24:53.3145092Z 
2025-05-07T20:24:53.3145097Z 
2025-05-07T20:24:53.3145344Z                                                      [A[A[A
2025-05-07T20:24:53.3145622Z 
2025-05-07T20:24:53.3145627Z 
2025-05-07T20:24:53.3145632Z 
2025-05-07T20:24:53.3145637Z 
2025-05-07T20:24:53.3145900Z                                                      [A[A[A[A
2025-05-07T20:24:53.3146133Z 
2025-05-07T20:24:53.3146137Z 
2025-05-07T20:24:53.3146140Z 
2025-05-07T20:24:53.3146144Z 
2025-05-07T20:24:53.3146148Z 
2025-05-07T20:24:53.3146332Z                                                      [A[A[A[A[A
2025-05-07T20:24:53.3146575Z 
2025-05-07T20:24:53.3146581Z 
2025-05-07T20:24:53.3146586Z 
2025-05-07T20:24:53.3146597Z 
2025-05-07T20:24:53.3146602Z 
2025-05-07T20:24:53.3146608Z 
2025-05-07T20:24:53.3146889Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:53.3147188Z 
2025-05-07T20:24:53.3147193Z 
2025-05-07T20:24:53.3147198Z 
2025-05-07T20:24:53.3147203Z 
2025-05-07T20:24:53.3147209Z 
2025-05-07T20:24:53.3147214Z 
2025-05-07T20:24:53.3147219Z 
2025-05-07T20:24:53.3147783Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:53.3148027Z 
2025-05-07T20:24:53.3148031Z 
2025-05-07T20:24:53.3148034Z 
2025-05-07T20:24:53.3148038Z 
2025-05-07T20:24:53.3148042Z 
2025-05-07T20:24:53.3148046Z 
2025-05-07T20:24:53.3148049Z 
2025-05-07T20:24:53.3148053Z 
2025-05-07T20:24:53.3148257Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:53.3148478Z 
2025-05-07T20:24:53.3148482Z 
2025-05-07T20:24:53.3148486Z 
2025-05-07T20:24:53.3148490Z 
2025-05-07T20:24:53.3148494Z 
2025-05-07T20:24:53.3148498Z 
2025-05-07T20:24:53.3148642Z 
2025-05-07T20:24:53.3148645Z 
2025-05-07T20:24:53.3148649Z 
2025-05-07T20:24:53.3148849Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:53.3149073Z 
2025-05-07T20:24:53.3149077Z 
2025-05-07T20:24:53.3149081Z 
2025-05-07T20:24:53.3149085Z 
2025-05-07T20:24:53.3149088Z 
2025-05-07T20:24:53.3149092Z 
2025-05-07T20:24:53.3149096Z 
2025-05-07T20:24:53.3149106Z 
2025-05-07T20:24:53.3149109Z 
2025-05-07T20:24:53.3149113Z 
2025-05-07T20:24:53.3149313Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:53.3149540Z 
2025-05-07T20:24:53.3149543Z 
2025-05-07T20:24:53.3149547Z 
2025-05-07T20:24:53.3149551Z 
2025-05-07T20:24:53.3149555Z 
2025-05-07T20:24:53.3149559Z 
2025-05-07T20:24:53.3149562Z 
2025-05-07T20:24:53.3149576Z 
2025-05-07T20:24:53.3149580Z 
2025-05-07T20:24:53.3149583Z 
2025-05-07T20:24:53.3149587Z 
2025-05-07T20:24:53.3149796Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:53.4150240Z Preparing transaction: \ done
2025-05-07T20:24:53.7160653Z Verifying transaction: / - \ done
2025-05-07T20:24:53.8170626Z Executing transaction: / done
2025-05-07T20:24:53.9871636Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:57.9302634Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:57.9303188Z 
2025-05-07T20:24:57.9317289Z 
2025-05-07T20:24:57.9335138Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:57.9335666Z 
2025-05-07T20:24:57.9347562Z 
2025-05-07T20:24:57.9364641Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:57.9365172Z 
2025-05-07T20:24:57.9377939Z 
2025-05-07T20:24:57.9395110Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:57.9395661Z 
2025-05-07T20:24:57.9407421Z 
2025-05-07T20:24:59.8374513Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:59.8374803Z 
2025-05-07T20:24:59.9046725Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:01.8009971Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:01.8010269Z 
2025-05-07T20:25:01.8691808Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:03.7681949Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:03.7682324Z 
2025-05-07T20:25:03.8336229Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:05.7274706Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:05.7274983Z 
2025-05-07T20:25:05.7917812Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:05.7921786Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:05.7922437Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:05.7922782Z 
2025-05-07T20:25:07.7021200Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:07.7021631Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:07.7022034Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:07.7022394Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:07.7022833Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:07.7023568Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:07.7023866Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:07.7024251Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:07.7024653Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:07.7024995Z #define __CHAR_BIT__ 8
2025-05-07T20:25:07.7025322Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:07.7025627Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:07.7025893Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:07.7026172Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:07.7026446Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:07.7026793Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.7027471Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:07.7027836Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:07.7028168Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:07.7028493Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:07.7028902Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:07.7029324Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:07.7029643Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:07.7029933Z #define __GCC_IEC_559 2
2025-05-07T20:25:07.7030181Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:07.7030472Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:07.7030747Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:07.7031026Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:07.7031364Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.7031692Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:07.7031966Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.7032251Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:07.7032519Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:07.7032778Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:07.7033044Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:07.7033302Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:07.7033568Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:07.7033819Z #define __INT8_C(c) c
2025-05-07T20:25:07.7034059Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:07.7034358Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.7034673Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:07.7034991Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.7035361Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:07.7035635Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.7035906Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.7036185Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:07.7036456Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:07.7036923Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:07.7037339Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:07.7037630Z #define __linux 1
2025-05-07T20:25:07.7037859Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:07.7038139Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:07.7038736Z #define __unix 1
2025-05-07T20:25:07.7039037Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:07.7039377Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:07.7039654Z #define __WINT_MIN__ 0U
2025-05-07T20:25:07.7039929Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.7040230Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:07.7040509Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:07.7040775Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:07.7041029Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:07.7041315Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:07.7041629Z #define __INT64_C(c) c ## L
2025-05-07T20:25:07.7041891Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:07.7042195Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:07.7042465Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:07.7042809Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:07.7043187Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:07.7043786Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:07.7044054Z #define __DBL_DIG__ 15
2025-05-07T20:25:07.7044285Z #define __FLT32_DIG__ 6
2025-05-07T20:25:07.7044592Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:07.7044939Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:07.7045192Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:07.7045518Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:07.7045852Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:07.7046107Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.7046373Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:07.7046882Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:07.7047276Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:07.7047554Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:07.7047814Z #define __unix__ 1
2025-05-07T20:25:07.7048031Z #define __INT_WIDTH__ 32
2025-05-07T20:25:07.7048278Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:07.7048529Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:07.7048773Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:07.7049044Z #define __UINT16_C(c) c
2025-05-07T20:25:07.7049283Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:07.7049536Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:07.7049894Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:07.7050255Z #define __gnu_linux__ 1
2025-05-07T20:25:07.7050492Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:07.7050769Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.7051053Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.7051328Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:07.7051586Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:07.7051842Z #define __GNUC__ 11
2025-05-07T20:25:07.7052062Z #define __pie__ 2
2025-05-07T20:25:07.7052274Z #define __MMX__ 1
2025-05-07T20:25:07.7052498Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:07.7052766Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:07.7053053Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:07.7053326Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:07.7053669Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.7054064Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7054381Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.7054647Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:07.7054909Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:07.7055209Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:07.7055478Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:07.7055736Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:07.7056027Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:07.7056321Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:07.7056589Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:07.7056871Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:07.7057124Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:07.7057401Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:07.7057695Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:07.7057949Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:07.7058210Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:07.7058533Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.7058892Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:07.7067570Z #define __SSE2_MATH__ 1
2025-05-07T20:25:07.7067874Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:07.7068185Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7068476Z #define __amd64 1
2025-05-07T20:25:07.7068714Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:07.7068999Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:07.7069301Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:07.7069617Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:07.7069865Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:07.7070125Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:07.7070365Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:07.7070789Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:07.7071060Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:07.7071319Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:07.7071587Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:07.7071874Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:07.7072120Z #define __x86_64 1
2025-05-07T20:25:07.7072355Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:07.7072730Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:07.7073184Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:07.7073728Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:07.7074194Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.7074583Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:07.7074833Z #define __LP64__ 1
2025-05-07T20:25:07.7075067Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.7075425Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:07.7075792Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:07.7076068Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:07.7076346Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.7076621Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:07.7076898Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:07.7077166Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:07.7077422Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:07.7077690Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:07.7077950Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.7078288Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:07.7078640Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:07.7078917Z #define __FLT_DIG__ 6
2025-05-07T20:25:07.7079155Z #define __NO_INLINE__ 1
2025-05-07T20:25:07.7079388Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:07.7079720Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:07.7080064Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:07.7080315Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:07.7080578Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:07.7080834Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:07.7081081Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:07.7081336Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:07.7081633Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:07.7081916Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:07.7082189Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:07.7082497Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.7082832Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:07.7083093Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:07.7083356Z #define __FLT128_DIG__ 33
2025-05-07T20:25:07.7083759Z #define __INT32_C(c) c
2025-05-07T20:25:07.7084015Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:07.7084294Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:07.7084578Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:07.7084849Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:07.7085166Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:07.7085476Z #define unix 1
2025-05-07T20:25:07.7085696Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:07.7086009Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.7086313Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:07.7086613Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:07.7086944Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:07.7087198Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:07.7087468Z #define __ELF__ 1
2025-05-07T20:25:07.7087692Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:07.7087980Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:07.7088256Z #define __FLT_RADIX__ 2
2025-05-07T20:25:07.7088500Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:07.7088856Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:07.7089331Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:07.7089584Z #define __SSE_MATH__ 1
2025-05-07T20:25:07.7089812Z #define __k8 1
2025-05-07T20:25:07.7090110Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:07.7090477Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:07.7090770Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:07.7091070Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:07.7091322Z #define __LDBL_DIG__ 18
2025-05-07T20:25:07.7091565Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:07.7091817Z #define __x86_64__ 1
2025-05-07T20:25:07.7092198Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:07.7092488Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:07.7092821Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7093124Z #define __FLT64_DIG__ 15
2025-05-07T20:25:07.7093395Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.7093747Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.7094064Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.7094319Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:07.7094596Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7094893Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:07.7095245Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:07.7095638Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:07.7095931Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:07.7096268Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:07.7096583Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:07.7096893Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:07.7097176Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:07.7097473Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:07.7097761Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:07.7098001Z #define __SEG_FS 1
2025-05-07T20:25:07.7098226Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:07.7098512Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:07.7098791Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7099071Z #define __SEG_GS 1
2025-05-07T20:25:07.7099383Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:07.7099773Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:07.7100087Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:07.7100366Z #define __INT16_TYPE__ short int
2025-05-07T20:25:07.7100644Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:07.7100940Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:07.7101207Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:07.7101459Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:07.7101717Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:07.7102049Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.7102430Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7102720Z #define linux 1
2025-05-07T20:25:07.7102940Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.7103216Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.7103489Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:07.7103730Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:07.7103993Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:07.7104256Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:07.7104600Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.7105001Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:07.7105330Z #define __code_model_small__ 1
2025-05-07T20:25:07.7105607Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:07.7105894Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:07.7106143Z #define __k8__ 1
2025-05-07T20:25:07.7106372Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:07.7106651Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:07.7106949Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:07.7107191Z #define __pic__ 2
2025-05-07T20:25:07.7107546Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.7107858Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:07.7108148Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7108470Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:07.7108835Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.7109191Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:07.7109463Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:07.7109750Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.7110058Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:07.7110397Z #define __linux__ 1
2025-05-07T20:25:07.7110619Z #define __INT64_TYPE__ long int
2025-05-07T20:25:07.7110889Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:07.7111152Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:07.7111420Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:07.7111682Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:07.7111977Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7112310Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:07.7112610Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:07.7112886Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:07.7113185Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:07.7113471Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:07.7113806Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.7114166Z #define __SSE__ 1
2025-05-07T20:25:07.7114387Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:07.7114731Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.7115087Z #define __amd64__ 1
2025-05-07T20:25:07.7115305Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:07.7115559Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:07.7115833Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:07.7116093Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:07.7116362Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:07.7116642Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:07.7116896Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:07.7117166Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:07.7117431Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:07.7117775Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:07.7118229Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:07.7118585Z #define _LP64 1
2025-05-07T20:25:07.7118799Z #define __UINT8_C(c) c
2025-05-07T20:25:07.7119035Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:07.7119296Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:07.7119561Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:07.7119832Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:07.7120127Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:07.7120470Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.7120929Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.7121304Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.7121588Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.7121896Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:07.7122259Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:07.7122624Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:07.7122885Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:07.7123221Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:07.7123723Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:07.7124000Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:07.7124259Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:07.7124507Z #define __FXSR__ 1
2025-05-07T20:25:07.7124798Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.7125249Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.7125659Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.7126063Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:07.7126318Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:07.7126656Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:07.7127010Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:07.7127249Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:07.7127487Z #define __PIC__ 2
2025-05-07T20:25:07.7127737Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:07.7128125Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.7128509Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:07.7128938Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.7129263Z #define __SSE2__ 1
2025-05-07T20:25:07.7129485Z #define __INT32_TYPE__ int
2025-05-07T20:25:07.7129735Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:07.7129983Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.7130317Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:07.7130686Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:07.7130958Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:07.7131218Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:07.7131485Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.7131771Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:07.7132009Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:07.7132270Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:07.7132557Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.7132846Z #define __PIE__ 2
2025-05-07T20:25:07.7133176Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:07.7133572Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:07.7133905Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:07.7134269Z #define __INT16_C(c) c
2025-05-07T20:25:07.7134494Z #define __STDC__ 1
2025-05-07T20:25:07.7134716Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:07.7134998Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:07.7135254Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.7135549Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:07.7135888Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:07.7136227Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:07.7136489Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.7136760Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:07.7137024Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:07.7137307Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:07.7137588Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.7137862Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:07.7138156Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.7139134Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.7139509Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:07.7139807Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:07.7140104Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:07.7140351Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:07.7140516Z 
2025-05-07T20:25:07.7668336Z 
2025-05-07T20:25:07.7669028Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:07.7669504Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:07.7669735Z 
2025-05-07T20:25:09.6736975Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:09.6737457Z #define __cpp_attributes 200809L
2025-05-07T20:25:09.6737913Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:09.6738596Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:09.6738926Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:09.6739178Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:09.6739511Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:09.6739985Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:09.6740410Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:09.6740840Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:09.6741648Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:09.6741925Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:09.6742169Z #define __CHAR_BIT__ 8
2025-05-07T20:25:09.6742409Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:09.6742653Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:09.6742900Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:09.6743175Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:09.6743450Z #define __cpp_static_assert 201411L
2025-05-07T20:25:09.6743734Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:09.6744123Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.6744574Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:09.6744862Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:09.6745178Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:09.6745494Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:09.6745889Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:09.6746297Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:09.6746605Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:09.6746881Z #define __GCC_IEC_559 2
2025-05-07T20:25:09.6747123Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:09.6747397Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:09.6747664Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:09.6747946Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:09.6748236Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:09.6748552Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:09.6748859Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:09.6749189Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.6749510Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:09.6749773Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:09.6750039Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:09.6750320Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:09.6750615Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:09.6750881Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:09.6751143Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:09.6751412Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:09.6751736Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:09.6752065Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:09.6752319Z #define __INT8_C(c) c
2025-05-07T20:25:09.6752550Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:09.6752817Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:09.6753139Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.6753463Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:09.6753742Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:09.6754031Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:09.6754343Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:09.6754689Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:09.6754975Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:09.6755259Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:09.6755515Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.6755792Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:09.6756068Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:09.6756455Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:09.6756865Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:09.6757154Z #define __linux 1
2025-05-07T20:25:09.6757381Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:09.6757655Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:09.6757936Z #define __unix 1
2025-05-07T20:25:09.6758163Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:09.6758436Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:09.6758724Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:09.6758993Z #define __WINT_MIN__ 0U
2025-05-07T20:25:09.6759228Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:09.6759510Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:09.6759871Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:09.6760134Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:09.6760386Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:09.6760701Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:09.6760997Z #define __INT64_C(c) c ## L
2025-05-07T20:25:09.6761263Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:09.6761563Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:09.6761835Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:09.6762127Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:09.6762401Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:09.6762743Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:09.6763082Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:09.6763457Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:09.6763864Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:09.6764132Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:09.6764406Z #define __DBL_DIG__ 15
2025-05-07T20:25:09.6764643Z #define __FLT32_DIG__ 6
2025-05-07T20:25:09.6764940Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:09.6765284Z #define __GXX_WEAK__ 1
2025-05-07T20:25:09.6765518Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:09.6765769Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:09.6766103Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:09.6766450Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:09.6766710Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:09.6767003Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:09.6767342Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:09.6767745Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:09.6768135Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:09.6768411Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:09.6768668Z #define __unix__ 1
2025-05-07T20:25:09.6768886Z #define __INT_WIDTH__ 32
2025-05-07T20:25:09.6769136Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:09.6769383Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:09.6769632Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:09.6769899Z #define __UINT16_C(c) c
2025-05-07T20:25:09.6770139Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:09.6770393Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:09.6770812Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:09.6771165Z #define __gnu_linux__ 1
2025-05-07T20:25:09.6771408Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:09.6771672Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:09.6771948Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:09.6772234Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.6772501Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:09.6772764Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:09.6773008Z #define __GNUC__ 11
2025-05-07T20:25:09.6773227Z #define __GXX_RTTI 1
2025-05-07T20:25:09.6773450Z #define __pie__ 2
2025-05-07T20:25:09.6781716Z #define __MMX__ 1
2025-05-07T20:25:09.6781964Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:09.6782243Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:09.6782531Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:09.6782800Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:09.6783056Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:09.6783364Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:09.6783678Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:09.6784028Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:09.6784400Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:09.6784710Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6785030Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:09.6785296Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:09.6785556Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:09.6785868Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:09.6786170Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:09.6786552Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:09.6786811Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:09.6787096Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:09.6787388Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:09.6787651Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:09.6787929Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:09.6788181Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:09.6788435Z #define __cplusplus 201703L
2025-05-07T20:25:09.6788703Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:09.6788984Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:09.6789313Z #define __DEPRECATED 1
2025-05-07T20:25:09.6789568Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:09.6789861Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:09.6790111Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:09.6790429Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:09.6790787Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:09.6791062Z #define __SSE2_MATH__ 1
2025-05-07T20:25:09.6791302Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:09.6791601Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6791893Z #define __amd64 1
2025-05-07T20:25:09.6792107Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:09.6792373Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:09.6792634Z #define __GNUG__ 11
2025-05-07T20:25:09.6792882Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:09.6793195Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:09.6793451Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:09.6793700Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:09.6793983Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:09.6794237Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:09.6794502Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:09.6794800Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:09.6795063Z #define __cpp_hex_float 201603L
2025-05-07T20:25:09.6795323Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:09.6795595Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:09.6795871Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:09.6796140Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:09.6796400Z #define __x86_64 1
2025-05-07T20:25:09.6796625Z #define __cpp_lambdas 200907L
2025-05-07T20:25:09.6796895Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:09.6797255Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:09.6797645Z #define __cpp_template_auto 201606L
2025-05-07T20:25:09.6798007Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:09.6798455Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:09.6798925Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:09.6799315Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:09.6799570Z #define __LP64__ 1
2025-05-07T20:25:09.6799798Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.6800153Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:09.6800525Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:09.6800794Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:09.6801076Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:09.6801350Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:09.6801610Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:09.6801867Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:09.6802129Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:09.6802446Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:09.6802804Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:09.6803082Z #define __FLT_DIG__ 6
2025-05-07T20:25:09.6803312Z #define __NO_INLINE__ 1
2025-05-07T20:25:09.6803552Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:09.6803996Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:09.6804344Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:09.6804600Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:09.6805013Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:09.6805270Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:09.6805537Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:09.6805829Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:09.6806082Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:09.6806364Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:09.6806651Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:09.6806916Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:09.6807207Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:09.6807541Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:09.6807906Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:09.6808157Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:09.6808413Z #define __FLT128_DIG__ 33
2025-05-07T20:25:09.6808651Z #define __INT32_C(c) c
2025-05-07T20:25:09.6808889Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:09.6809161Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:09.6809438Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:09.6809713Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:09.6810019Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:09.6810327Z #define unix 1
2025-05-07T20:25:09.6810545Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:09.6810796Z #define __cpp_rtti 199711L
2025-05-07T20:25:09.6811064Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:09.6811374Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.6811667Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:09.6811977Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:09.6812314Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:09.6812556Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:09.6812844Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:09.6813128Z #define __ELF__ 1
2025-05-07T20:25:09.6813361Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:09.6813636Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:09.6813920Z #define __FLT_RADIX__ 2
2025-05-07T20:25:09.6814173Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:09.6814521Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:09.6814881Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:09.6815153Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:09.6815420Z #define __k8 1
2025-05-07T20:25:09.6815718Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:09.6816085Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:09.6816375Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:09.6816679Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:09.6816937Z #define __LDBL_DIG__ 18
2025-05-07T20:25:09.6817179Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:09.6817429Z #define __x86_64__ 1
2025-05-07T20:25:09.6817666Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:09.6817965Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:09.6818299Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6818603Z #define __FLT64_DIG__ 15
2025-05-07T20:25:09.6818883Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.6819221Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:09.6819542Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.6819804Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:09.6820073Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6820369Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:09.6820731Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:09.6821121Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:09.6821407Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:09.6821733Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:09.6822048Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:09.6822359Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:09.6822658Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:09.6823021Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:09.6823321Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:09.6823602Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:09.6823839Z #define __SEG_FS 1
2025-05-07T20:25:09.6824064Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:09.6824341Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:09.6824616Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6824894Z #define __SEG_GS 1
2025-05-07T20:25:09.6825204Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:09.6825585Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:09.6825938Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:09.6826218Z #define __INT16_TYPE__ short int
2025-05-07T20:25:09.6826499Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:09.6826803Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:09.6827094Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:09.6827343Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:09.6827614Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:09.6827949Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:09.6828335Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6828651Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:09.6828967Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:09.6829268Z #define linux 1
2025-05-07T20:25:09.6829493Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.6829770Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:09.6830031Z #define __EXCEPTIONS 1
2025-05-07T20:25:09.6830273Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:09.6830540Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:09.6830798Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:09.6831088Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:09.6831433Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:09.6831808Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:09.6832162Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:09.6832479Z #define __code_model_small__ 1
2025-05-07T20:25:09.6832748Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:09.6833048Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:09.6833339Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:09.6833611Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:09.6833898Z #define __k8__ 1
2025-05-07T20:25:09.6834118Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:09.6834398Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:09.6834688Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:09.6834931Z #define __pic__ 2
2025-05-07T20:25:09.6835181Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.6835486Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:09.6835753Z #define __cpp_decltype 200707L
2025-05-07T20:25:09.6836031Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6836355Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:09.6836722Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:09.6837068Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:09.6837362Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:09.6837677Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:09.6837955Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:09.6838202Z #define __linux__ 1
2025-05-07T20:25:09.6838650Z #define __INT64_TYPE__ long int
2025-05-07T20:25:09.6838961Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:09.6839224Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:09.6839499Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:09.6839781Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:09.6840097Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:09.6840388Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6840701Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:09.6840960Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:09.6841408Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:09.6841708Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:09.6842036Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:09.6842394Z #define __SSE__ 1
2025-05-07T20:25:09.6842627Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:09.6842959Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:09.6843301Z #define __amd64__ 1
2025-05-07T20:25:09.6843526Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:09.6843876Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:09.6844146Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:09.6844535Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:09.6844806Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:09.6845056Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:09.6845325Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:09.6845598Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:09.6845932Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:09.6846396Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:09.6846747Z #define _LP64 1
2025-05-07T20:25:09.6846958Z #define __UINT8_C(c) c
2025-05-07T20:25:09.6847201Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:09.6847462Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:09.6847723Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:09.6847986Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:09.6848337Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:09.6848800Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:09.6849171Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.6849461Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.6849768Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:09.6850070Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:09.6850471Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:09.6850864Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:09.6851119Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:09.6851381Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:09.6851717Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:09.6852074Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:09.6852330Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:09.6852581Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:09.6852831Z #define __FXSR__ 1
2025-05-07T20:25:09.6853120Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:09.6853565Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:09.6853973Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:09.6854273Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:09.6854545Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:09.6854845Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:09.6855128Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:09.6855399Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:09.6855757Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:09.6856114Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:09.6856380Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:09.6856631Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:09.6856867Z #define __PIC__ 2
2025-05-07T20:25:09.6857110Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:09.6857506Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:09.6857891Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:09.6858219Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:09.6858559Z #define __cpp_constexpr 201603L
2025-05-07T20:25:09.6858814Z #define __SSE2__ 1
2025-05-07T20:25:09.6859042Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:09.6859333Z #define __INT32_TYPE__ int
2025-05-07T20:25:09.6859591Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:09.6859938Z #define __cpp_exceptions 199711L
2025-05-07T20:25:09.6860221Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:09.6860556Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:09.6860911Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:09.6861171Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:09.6861435Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:09.6861697Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.6861961Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:09.6862207Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:09.6862458Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:09.6862867Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:09.6863151Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.6863442Z #define __PIE__ 2
2025-05-07T20:25:09.6863756Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:09.6864160Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:09.6864467Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:09.6864809Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:09.6865161Z #define __INT16_C(c) c
2025-05-07T20:25:09.6865383Z #define __STDC__ 1
2025-05-07T20:25:09.6865600Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:09.6865849Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:09.6866116Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:09.6866370Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:09.6866657Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:09.6866999Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:09.6867334Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:09.6867589Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:09.6867876Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:09.6868153Z #define __SSE_MATH__ 1
2025-05-07T20:25:09.6868384Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:09.6868661Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:09.6868966Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:09.6869243Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:09.6869523Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.6869791Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:09.6870082Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.6870467Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:09.6870840Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:09.6871138Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:09.6871419Z #define _GNU_SOURCE 1
2025-05-07T20:25:09.6871666Z #define __cpp_init_captures 201304L
2025-05-07T20:25:09.6871941Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:09.6872178Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:09.6872337Z 
2025-05-07T20:25:09.7393317Z 
2025-05-07T20:25:09.7393770Z + conda run -n build_binary c++ --version
2025-05-07T20:25:09.7394315Z 
2025-05-07T20:25:11.6334822Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:11.6335287Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:11.6335748Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:11.6336286Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:11.6336619Z 
2025-05-07T20:25:11.6336624Z 
2025-05-07T20:25:11.6970298Z 
2025-05-07T20:25:11.6970656Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:11.6971320Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:11.6971616Z 
2025-05-07T20:25:13.6598114Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:13.6600527Z 
2025-05-07T20:25:13.6601298Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:13.6602016Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:13.6602327Z 
2025-05-07T20:25:15.6325033Z #define __cplusplus 201703L
2025-05-07T20:25:15.6327978Z 
2025-05-07T20:25:15.6328188Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:15.6365173Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:15.6365599Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:15.6385437Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:15.6385932Z env:
2025-05-07T20:25:15.6386152Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:15.6386446Z   BUILD_ENV: build_binary
2025-05-07T20:25:15.6386674Z   BUILD_TARGET: genai
2025-05-07T20:25:15.6386896Z   BUILD_VARIANT: cuda
2025-05-07T20:25:15.6387332Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:15.6387571Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:15.6387869Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:15.6388199Z ##[endgroup]
2025-05-07T20:25:15.9797496Z ################################################################################
2025-05-07T20:25:15.9797836Z # Install CUDA
2025-05-07T20:25:15.9798042Z #
2025-05-07T20:25:15.9812696Z # [2025-05-07T20:25:15.980Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:15.9813184Z ################################################################################
2025-05-07T20:25:15.9813406Z 
2025-05-07T20:25:15.9827820Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:16.0709025Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:16.0709490Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:16.0713676Z + conda clean --packages --tarball -y
2025-05-07T20:25:16.0713897Z 
2025-05-07T20:25:16.7893092Z Will remove 32 (148.9 MB) tarball(s).
2025-05-07T20:25:16.7893595Z Will remove 6 (619 KB) package(s).
2025-05-07T20:25:16.8551390Z 
2025-05-07T20:25:16.8559855Z + conda clean --all -y
2025-05-07T20:25:16.8560060Z 
2025-05-07T20:25:17.5282945Z There are no unused tarball(s) to remove.
2025-05-07T20:25:17.5283336Z Will remove 1 index cache(s).
2025-05-07T20:25:17.5283845Z There are no unused package(s) to remove.
2025-05-07T20:25:17.5284198Z There are no tempfile(s) to remove.
2025-05-07T20:25:17.5284515Z There are no logfile(s) to remove.
2025-05-07T20:25:17.5956445Z 
2025-05-07T20:25:17.5971478Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:25:17.5994646Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:25:18.5074401Z Channels:
2025-05-07T20:25:18.5074676Z  - conda-forge
2025-05-07T20:25:18.5074896Z Platform: linux-64
2025-05-07T20:25:29.2683460Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:30.3774494Z Solving environment: - \ | / done
2025-05-07T20:25:30.4510824Z 
2025-05-07T20:25:30.4510943Z ## Package Plan ##
2025-05-07T20:25:30.4511093Z 
2025-05-07T20:25:30.4511313Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:30.4511609Z 
2025-05-07T20:25:30.4511715Z   added / updated specs:
2025-05-07T20:25:30.4511960Z     - cuda=12.6.3
2025-05-07T20:25:30.4512105Z 
2025-05-07T20:25:30.4512126Z 
2025-05-07T20:25:30.4512258Z The following packages will be downloaded:
2025-05-07T20:25:30.4512571Z 
2025-05-07T20:25:30.4512750Z     package                    |            build
2025-05-07T20:25:30.4513145Z     ---------------------------|-----------------
2025-05-07T20:25:30.4513514Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:30.4513933Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:30.4514345Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:30.4514783Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:30.4515199Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:30.4515680Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:25:30.4516778Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:30.4517338Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:30.4517817Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:25:30.4518294Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:25:30.4518748Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:30.4519226Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:30.4519910Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:25:30.4520420Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:30.4520934Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:25:30.4521452Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:25:30.4521943Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:25:30.4522407Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:25:30.4522851Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:25:30.4523324Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:25:30.4523954Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:30.4524445Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:25:30.4524984Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:25:30.4525427Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:30.4525905Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:30.4526374Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:30.4526812Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:30.4527306Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:25:30.4527804Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:25:30.4528271Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:25:30.4528742Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:30.4529218Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:25:30.4529673Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:25:30.4530125Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:25:30.4530576Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:25:30.4531024Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:25:30.4531467Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:25:30.4531912Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:30.4532370Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:25:30.4532841Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:25:30.4533308Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:25:30.4533760Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:25:30.4534197Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:30.4534653Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:25:30.4535265Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:30.4535750Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:30.4536216Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:25:30.4536684Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:30.4537125Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:30.4537561Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:25:30.4538107Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:30.4539046Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:30.4539466Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:30.4539851Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:30.4540325Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:30.4540846Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:30.4541364Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:30.4541855Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:30.4542304Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:30.4542770Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:30.4543251Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:30.4543681Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:30.4544084Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:30.4544488Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:25:30.4544893Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:30.4545273Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:30.4545672Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:30.4546077Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:30.4546465Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:30.4546893Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:25:30.4547341Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:25:30.4547781Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:25:30.4548222Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:30.4548671Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:25:30.4549120Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:25:30.4549563Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:25:30.4550016Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:25:30.4550478Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:25:30.4550936Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:25:30.4551409Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:25:30.4551876Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:30.4552341Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:30.4552771Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:30.4553362Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:30.4553815Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:30.4554268Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:30.4554716Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:30.4555150Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:30.4555703Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:30.4556115Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:30.4556519Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:25:30.4556948Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:25:30.4557382Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:30.4557789Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:30.4558215Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:25:30.4558683Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:30.4559151Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:25:30.4559613Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:30.4560083Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:25:30.4560541Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:30.4560980Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:30.4561389Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:30.4561835Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:30.4562267Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:30.4562678Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:30.4563081Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:30.4563507Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:30.4564067Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:30.4564485Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:30.4564890Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:30.4565285Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:30.4565723Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:25:30.4566162Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:30.4566543Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:30.4566933Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:30.4567420Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:30.4567864Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:30.4568297Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:30.4568742Z     python-3.11.8              |hab00c5b_0_cpython        29.3 MB  conda-forge
2025-05-07T20:25:30.4569162Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:30.4569577Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:30.4570071Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:30.4570468Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:30.4570880Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:30.4571317Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:30.4571778Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:30.4572231Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:30.4572822Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:30.4573284Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:30.4573728Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:30.4574185Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:30.4574624Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:30.4575057Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:30.4575486Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:30.4575959Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:30.4576442Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:30.4576903Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:30.4577355Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:30.4577811Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:30.4578256Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:30.4578699Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:30.4579167Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:30.4579629Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:30.4580043Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:30.4580423Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:30.4580805Z     ------------------------------------------------------------
2025-05-07T20:25:30.4581164Z                                            Total:        1.64 GB
2025-05-07T20:25:30.4581376Z 
2025-05-07T20:25:30.4581508Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:30.4581735Z 
2025-05-07T20:25:30.4581948Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:30.4582374Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:30.4582800Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:30.4583255Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:30.4583698Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:25:30.4584174Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:25:30.4584770Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:30.4585345Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:25:30.4585898Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:30.4586458Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:25:30.4586976Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:25:30.4587498Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:30.4588164Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:30.4588772Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:25:30.4592015Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:30.4592627Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:30.4593191Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:25:30.4593809Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:25:30.4594318Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:25:30.4594839Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:25:30.4595379Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:30.4595962Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:30.4596484Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:25:30.4596975Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:25:30.4597540Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:25:30.4598087Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:25:30.4598561Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:25:30.4599094Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:25:30.4599655Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:25:30.4600205Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:25:30.4600763Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:25:30.4601307Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:25:30.4601831Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:25:30.4602344Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:25:30.4602846Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:25:30.4603356Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:25:30.4603974Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:30.4604470Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:25:30.4604991Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:30.4605555Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:25:30.4606107Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:25:30.4606621Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:25:30.4607096Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:25:30.4607622Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:30.4608197Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:25:30.4608743Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:25:30.4609294Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:25:30.4609847Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:25:30.4610333Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:30.4610803Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:25:30.4611441Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:30.4611992Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:30.4612445Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:30.4612843Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:30.4613360Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:30.4613970Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:30.4614672Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:30.4615242Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:30.4615746Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:30.4616243Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:30.4616739Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:30.4617202Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:30.4617626Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:30.4618051Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:25:30.4618475Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:30.4618848Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:30.4619268Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:30.4619685Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:30.4620086Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:30.4620531Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:25:30.4621048Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:25:30.4621545Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:25:30.4622025Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:25:30.4622523Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:25:30.4623032Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:25:30.4623537Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:25:30.4624049Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:25:30.4624568Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:25:30.4625105Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:25:30.4625647Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:25:30.4626181Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:25:30.4626704Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:30.4627165Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:30.4627641Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:30.4628139Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:30.4628656Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:30.4629152Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:30.4629611Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:30.4630083Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:30.4630517Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:30.4631040Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:25:30.4631505Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:25:30.4631966Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:30.4632394Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:30.4632864Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:25:30.4633389Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:30.4634199Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:25:30.4634807Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:30.4635343Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:25:30.4635853Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:25:30.4636353Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:30.4636807Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:30.4637282Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:30.4637740Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:30.4638182Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:30.4638954Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:30.4639458Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:30.4639924Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:30.4640360Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:30.4640786Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:30.4641279Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:25:30.4641771Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:30.4642168Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:30.4642578Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:30.4643073Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:30.4643644Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:30.4644153Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:30.4644654Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:30.4645096Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:30.4645542Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:30.4646042Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:30.4646572Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:30.4647115Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:30.4647697Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:30.4648257Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:30.4648770Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:30.4649304Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:30.4649789Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:30.4650274Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:30.4650753Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:30.4651498Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:30.4652090Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:30.4652621Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:30.4662283Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:30.4662810Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:30.4663317Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:30.4664135Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:30.4664806Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:30.4665337Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:30.4665790Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:30.4666035Z 
2025-05-07T20:25:30.4666158Z The following packages will be UPDATED:
2025-05-07T20:25:30.4666369Z 
2025-05-07T20:25:30.4666642Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:30.4667234Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:30.4667560Z 
2025-05-07T20:25:30.4667785Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:30.4668096Z 
2025-05-07T20:25:30.4668388Z   python               pkgs/main::python-3.11.11-he870216_0 --> conda-forge::python-3.11.8-hab00c5b_0_cpython 
2025-05-07T20:25:30.4669024Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:30.4669603Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:30.4669921Z 
2025-05-07T20:25:30.4669960Z 
2025-05-07T20:25:30.4669964Z 
2025-05-07T20:25:30.4670113Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:30.4670495Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:30.4670731Z 
2025-05-07T20:25:30.4671148Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:30.4671385Z 
2025-05-07T20:25:30.4671389Z 
2025-05-07T20:25:30.4671641Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:30.4671887Z 
2025-05-07T20:25:30.4671891Z 
2025-05-07T20:25:30.4671895Z 
2025-05-07T20:25:30.4672122Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:30.4672387Z 
2025-05-07T20:25:30.4672391Z 
2025-05-07T20:25:30.4672395Z 
2025-05-07T20:25:30.4672399Z 
2025-05-07T20:25:30.4673187Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:30.4673454Z 
2025-05-07T20:25:30.4673458Z 
2025-05-07T20:25:30.4673461Z 
2025-05-07T20:25:30.4673465Z 
2025-05-07T20:25:30.4673469Z 
2025-05-07T20:25:30.4673709Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:30.4673976Z 
2025-05-07T20:25:30.4673980Z 
2025-05-07T20:25:30.4673984Z 
2025-05-07T20:25:30.4673987Z 
2025-05-07T20:25:30.4673991Z 
2025-05-07T20:25:30.4673995Z 
2025-05-07T20:25:30.4674241Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:30.4674517Z 
2025-05-07T20:25:30.4674521Z 
2025-05-07T20:25:30.4674524Z 
2025-05-07T20:25:30.4674528Z 
2025-05-07T20:25:30.4674531Z 
2025-05-07T20:25:30.4674535Z 
2025-05-07T20:25:30.4674544Z 
2025-05-07T20:25:30.4675281Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:30.4675551Z 
2025-05-07T20:25:30.4676024Z 
2025-05-07T20:25:30.4676167Z 
2025-05-07T20:25:30.4676172Z 
2025-05-07T20:25:30.4676176Z 
2025-05-07T20:25:30.4676180Z 
2025-05-07T20:25:30.4676184Z 
2025-05-07T20:25:30.4679076Z 
2025-05-07T20:25:30.4680919Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4681344Z 
2025-05-07T20:25:30.4681492Z 
2025-05-07T20:25:30.4681497Z 
2025-05-07T20:25:30.4681501Z 
2025-05-07T20:25:30.4681504Z 
2025-05-07T20:25:30.4681508Z 
2025-05-07T20:25:30.4681512Z 
2025-05-07T20:25:30.4681516Z 
2025-05-07T20:25:30.4690119Z 
2025-05-07T20:25:30.4698166Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4698483Z 
2025-05-07T20:25:30.4698488Z 
2025-05-07T20:25:30.4698492Z 
2025-05-07T20:25:30.4698496Z 
2025-05-07T20:25:30.4698500Z 
2025-05-07T20:25:30.4698504Z 
2025-05-07T20:25:30.4698634Z 
2025-05-07T20:25:30.4698638Z 
2025-05-07T20:25:30.4698642Z 
2025-05-07T20:25:30.4701800Z 
2025-05-07T20:25:30.4704106Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4704575Z 
2025-05-07T20:25:30.4704581Z 
2025-05-07T20:25:30.4704586Z 
2025-05-07T20:25:30.4704592Z 
2025-05-07T20:25:30.4704598Z 
2025-05-07T20:25:30.4704604Z 
2025-05-07T20:25:30.4704610Z 
2025-05-07T20:25:30.4704616Z 
2025-05-07T20:25:30.4704633Z 
2025-05-07T20:25:30.4704657Z 
2025-05-07T20:25:30.4704663Z 
2025-05-07T20:25:30.4705621Z python-3.11.8        | 29.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4706162Z 
2025-05-07T20:25:30.4706169Z 
2025-05-07T20:25:30.4706188Z 
2025-05-07T20:25:30.4706194Z 
2025-05-07T20:25:30.4706199Z 
2025-05-07T20:25:30.4706205Z 
2025-05-07T20:25:30.4706210Z 
2025-05-07T20:25:30.4706216Z 
2025-05-07T20:25:30.4706222Z 
2025-05-07T20:25:30.4706227Z 
2025-05-07T20:25:30.4706239Z 
2025-05-07T20:25:30.4706245Z 
2025-05-07T20:25:30.4708323Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4708843Z 
2025-05-07T20:25:30.4708850Z 
2025-05-07T20:25:30.4708855Z 
2025-05-07T20:25:30.4708861Z 
2025-05-07T20:25:30.4708866Z 
2025-05-07T20:25:30.4708872Z 
2025-05-07T20:25:30.4708877Z 
2025-05-07T20:25:30.4708883Z 
2025-05-07T20:25:30.4708888Z 
2025-05-07T20:25:30.4708894Z 
2025-05-07T20:25:30.4708899Z 
2025-05-07T20:25:30.4708905Z 
2025-05-07T20:25:30.4708922Z 
2025-05-07T20:25:30.4709423Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4709906Z 
2025-05-07T20:25:30.4709912Z 
2025-05-07T20:25:30.4709918Z 
2025-05-07T20:25:30.4709923Z 
2025-05-07T20:25:30.4709929Z 
2025-05-07T20:25:30.4709934Z 
2025-05-07T20:25:30.4709940Z 
2025-05-07T20:25:30.4709945Z 
2025-05-07T20:25:30.4709963Z 
2025-05-07T20:25:30.4709969Z 
2025-05-07T20:25:30.4709974Z 
2025-05-07T20:25:30.4709980Z 
2025-05-07T20:25:30.4709986Z 
2025-05-07T20:25:30.4710001Z 
2025-05-07T20:25:30.4711308Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4711837Z 
2025-05-07T20:25:30.4711843Z 
2025-05-07T20:25:30.4711848Z 
2025-05-07T20:25:30.4711854Z 
2025-05-07T20:25:30.4711859Z 
2025-05-07T20:25:30.4711865Z 
2025-05-07T20:25:30.4711870Z 
2025-05-07T20:25:30.4711876Z 
2025-05-07T20:25:30.4711881Z 
2025-05-07T20:25:30.4711887Z 
2025-05-07T20:25:30.4711893Z 
2025-05-07T20:25:30.4711908Z 
2025-05-07T20:25:30.4711914Z 
2025-05-07T20:25:30.4711920Z 
2025-05-07T20:25:30.4711927Z 
2025-05-07T20:25:30.4712929Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4713255Z 
2025-05-07T20:25:30.4713260Z 
2025-05-07T20:25:30.4713265Z 
2025-05-07T20:25:30.4713270Z 
2025-05-07T20:25:30.4713275Z 
2025-05-07T20:25:30.4713280Z 
2025-05-07T20:25:30.4713285Z 
2025-05-07T20:25:30.4713290Z 
2025-05-07T20:25:30.4713295Z 
2025-05-07T20:25:30.4713310Z 
2025-05-07T20:25:30.4713315Z 
2025-05-07T20:25:30.4713320Z 
2025-05-07T20:25:30.4713336Z 
2025-05-07T20:25:30.4713341Z 
2025-05-07T20:25:30.4713347Z 
2025-05-07T20:25:30.4713357Z 
2025-05-07T20:25:30.4715975Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4716507Z 
2025-05-07T20:25:30.4716513Z 
2025-05-07T20:25:30.4716518Z 
2025-05-07T20:25:30.4716523Z 
2025-05-07T20:25:30.4716529Z 
2025-05-07T20:25:30.4716695Z 
2025-05-07T20:25:30.4716704Z 
2025-05-07T20:25:30.4716710Z 
2025-05-07T20:25:30.4716716Z 
2025-05-07T20:25:30.4716721Z 
2025-05-07T20:25:30.4716727Z 
2025-05-07T20:25:30.4716732Z 
2025-05-07T20:25:30.4716738Z 
2025-05-07T20:25:30.4716744Z 
2025-05-07T20:25:30.4716750Z 
2025-05-07T20:25:30.4716755Z 
2025-05-07T20:25:30.4716761Z 
2025-05-07T20:25:30.4717429Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4717765Z 
2025-05-07T20:25:30.4717769Z 
2025-05-07T20:25:30.4717905Z 
2025-05-07T20:25:30.4717909Z 
2025-05-07T20:25:30.4717927Z 
2025-05-07T20:25:30.4717933Z 
2025-05-07T20:25:30.4717937Z 
2025-05-07T20:25:30.4717941Z 
2025-05-07T20:25:30.4717945Z 
2025-05-07T20:25:30.4717948Z 
2025-05-07T20:25:30.4717952Z 
2025-05-07T20:25:30.4717956Z 
2025-05-07T20:25:30.4717959Z 
2025-05-07T20:25:30.4717963Z 
2025-05-07T20:25:30.4717966Z 
2025-05-07T20:25:30.4717970Z 
2025-05-07T20:25:30.4717974Z 
2025-05-07T20:25:30.4717981Z 
2025-05-07T20:25:30.4718563Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.4718893Z 
2025-05-07T20:25:30.4718909Z 
2025-05-07T20:25:30.4718914Z 
2025-05-07T20:25:30.4718919Z 
2025-05-07T20:25:30.4718924Z 
2025-05-07T20:25:30.4718929Z 
2025-05-07T20:25:30.4718935Z 
2025-05-07T20:25:30.4718940Z 
2025-05-07T20:25:30.4718945Z 
2025-05-07T20:25:30.4718950Z 
2025-05-07T20:25:30.4718955Z 
2025-05-07T20:25:30.4718960Z 
2025-05-07T20:25:30.4718976Z 
2025-05-07T20:25:30.4718989Z 
2025-05-07T20:25:30.4718994Z 
2025-05-07T20:25:30.4719000Z 
2025-05-07T20:25:30.4719005Z 
2025-05-07T20:25:30.4719010Z 
2025-05-07T20:25:30.4719015Z 
2025-05-07T20:25:30.5616886Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.5617685Z 
2025-05-07T20:25:30.5660439Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:30.5661302Z 
2025-05-07T20:25:30.5661307Z 
2025-05-07T20:25:30.5702603Z libcufft-11.3.0.4    | 156.2 MB  | 5          |   5% [A[A
2025-05-07T20:25:30.5702871Z 
2025-05-07T20:25:30.5702875Z 
2025-05-07T20:25:30.5703048Z 
2025-05-07T20:25:30.5754299Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:30.5761281Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:30.5761609Z 
2025-05-07T20:25:30.5761613Z 
2025-05-07T20:25:30.5761625Z 
2025-05-07T20:25:30.5761629Z 
2025-05-07T20:25:30.6619628Z cuda-nsight-12.6.77  | 113.2 MB  | 2          |   2% [A[A[A[A
2025-05-07T20:25:30.6622413Z 
2025-05-07T20:25:30.6712550Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   2% [A
2025-05-07T20:25:30.6712802Z 
2025-05-07T20:25:30.6712820Z 
2025-05-07T20:25:30.6714097Z 
2025-05-07T20:25:30.6755631Z libcusparse-12.5.4.2 | 118.6 MB  | 2          |   2% [A[A[A
2025-05-07T20:25:30.6764257Z nsight-compute-2024. | 443.1 MB  |            |   1% 
2025-05-07T20:25:30.6764541Z 
2025-05-07T20:25:30.6764546Z 
2025-05-07T20:25:30.6764550Z 
2025-05-07T20:25:30.6764553Z 
2025-05-07T20:25:30.7410423Z cuda-nsight-12.6.77  | 113.2 MB  | 5          |   6% [A[A[A[A
2025-05-07T20:25:30.7410825Z 
2025-05-07T20:25:30.7414131Z 
2025-05-07T20:25:30.7637629Z libcufft-11.3.0.4    | 156.2 MB  | 9          |  10% [A[A
2025-05-07T20:25:30.7638053Z 
2025-05-07T20:25:30.7717678Z libcublas-12.6.4.1   | 256.2 MB  | 3          |   3% [A
2025-05-07T20:25:30.7717939Z 
2025-05-07T20:25:30.7717943Z 
2025-05-07T20:25:30.7719229Z 
2025-05-07T20:25:30.7757821Z libcusparse-12.5.4.2 | 118.6 MB  | 4          |   5% [A[A[A
2025-05-07T20:25:30.7765177Z nsight-compute-2024. | 443.1 MB  | 1          |   2% 
2025-05-07T20:25:30.7765526Z 
2025-05-07T20:25:30.7765531Z 
2025-05-07T20:25:30.7765535Z 
2025-05-07T20:25:30.7767147Z 
2025-05-07T20:25:30.8640512Z cuda-nsight-12.6.77  | 113.2 MB  | 9          |   9% [A[A[A[A
2025-05-07T20:25:30.8640848Z 
2025-05-07T20:25:30.8719410Z libcublas-12.6.4.1   | 256.2 MB  | 4          |   5% [A
2025-05-07T20:25:30.8719680Z 
2025-05-07T20:25:30.8719684Z 
2025-05-07T20:25:30.8720832Z 
2025-05-07T20:25:30.8758026Z libcusparse-12.5.4.2 | 118.6 MB  | 8          |   8% [A[A[A
2025-05-07T20:25:30.8765988Z nsight-compute-2024. | 443.1 MB  | 2          |   3% 
2025-05-07T20:25:30.8766255Z 
2025-05-07T20:25:30.8766259Z 
2025-05-07T20:25:30.8766263Z 
2025-05-07T20:25:30.8766268Z 
2025-05-07T20:25:30.8911658Z cuda-nsight-12.6.77  | 113.2 MB  | #2         |  13% [A[A[A[A
2025-05-07T20:25:30.8912011Z 
2025-05-07T20:25:30.8915046Z 
2025-05-07T20:25:30.9720595Z libcufft-11.3.0.4    | 156.2 MB  | #3         |  14% [A[A
2025-05-07T20:25:30.9721136Z 
2025-05-07T20:25:30.9721143Z 
2025-05-07T20:25:30.9721148Z 
2025-05-07T20:25:30.9758187Z libcusparse-12.5.4.2 | 118.6 MB  | #1         |  12% [A[A[A
2025-05-07T20:25:30.9768537Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:25:30.9768846Z 
2025-05-07T20:25:30.9768930Z 
2025-05-07T20:25:30.9768934Z 
2025-05-07T20:25:30.9769405Z 
2025-05-07T20:25:31.0053831Z cuda-nsight-12.6.77  | 113.2 MB  | #6         |  16% [A[A[A[A
2025-05-07T20:25:31.0054150Z 
2025-05-07T20:25:31.0054939Z 
2025-05-07T20:25:31.0252957Z libcufft-11.3.0.4    | 156.2 MB  | #6         |  17% [A[A
2025-05-07T20:25:31.0255827Z 
2025-05-07T20:25:31.0721602Z libcublas-12.6.4.1   | 256.2 MB  | 5          |   6% [A
2025-05-07T20:25:31.0721874Z 
2025-05-07T20:25:31.0721879Z 
2025-05-07T20:25:31.0722214Z 
2025-05-07T20:25:31.0769308Z libcusparse-12.5.4.2 | 118.6 MB  | #5         |  15% [A[A[A
2025-05-07T20:25:31.0769687Z 
2025-05-07T20:25:31.0769693Z 
2025-05-07T20:25:31.0769726Z 
2025-05-07T20:25:31.0769730Z 
2025-05-07T20:25:31.0810270Z cuda-nsight-12.6.77  | 113.2 MB  | #9         |  20% [A[A[A[A
2025-05-07T20:25:31.1173524Z nsight-compute-2024. | 443.1 MB  | 4          |   4% 
2025-05-07T20:25:31.1173916Z 
2025-05-07T20:25:31.1175401Z 
2025-05-07T20:25:31.1253008Z libcufft-11.3.0.4    | 156.2 MB  | #9         |  20% [A[A
2025-05-07T20:25:31.1256525Z 
2025-05-07T20:25:31.1724189Z libcublas-12.6.4.1   | 256.2 MB  | 7          |   7% [A
2025-05-07T20:25:31.1724460Z 
2025-05-07T20:25:31.1724465Z 
2025-05-07T20:25:31.1724980Z 
2025-05-07T20:25:31.1770521Z libcusparse-12.5.4.2 | 118.6 MB  | #8         |  19% [A[A[A
2025-05-07T20:25:31.1770869Z 
2025-05-07T20:25:31.1770875Z 
2025-05-07T20:25:31.1770881Z 
2025-05-07T20:25:31.1771573Z 
2025-05-07T20:25:31.2201264Z cuda-nsight-12.6.77  | 113.2 MB  | ##3        |  24% [A[A[A[A
2025-05-07T20:25:31.2201542Z 
2025-05-07T20:25:31.2203151Z 
2025-05-07T20:25:31.2255941Z libcufft-11.3.0.4    | 156.2 MB  | ##2        |  22% [A[A
2025-05-07T20:25:31.2260456Z 
2025-05-07T20:25:31.2724550Z libcublas-12.6.4.1   | 256.2 MB  | 9          |   9% [A
2025-05-07T20:25:31.2724802Z 
2025-05-07T20:25:31.2724807Z 
2025-05-07T20:25:31.2726856Z 
2025-05-07T20:25:31.2742181Z libcusparse-12.5.4.2 | 118.6 MB  | ##2        |  22% [A[A[A
2025-05-07T20:25:31.2771625Z nsight-compute-2024. | 443.1 MB  | 5          |   5% 
2025-05-07T20:25:31.2771882Z 
2025-05-07T20:25:31.2771886Z 
2025-05-07T20:25:31.2771890Z 
2025-05-07T20:25:31.2773739Z 
2025-05-07T20:25:31.3260292Z cuda-nsight-12.6.77  | 113.2 MB  | ##7        |  28% [A[A[A[A
2025-05-07T20:25:31.3263641Z 
2025-05-07T20:25:31.3272837Z libcublas-12.6.4.1   | 256.2 MB  | #          |  10% [A
2025-05-07T20:25:31.3273166Z 
2025-05-07T20:25:31.3276706Z 
2025-05-07T20:25:31.3725941Z libcufft-11.3.0.4    | 156.2 MB  | ##5        |  25% [A[A
2025-05-07T20:25:31.3726208Z 
2025-05-07T20:25:31.3726224Z 
2025-05-07T20:25:31.3726227Z 
2025-05-07T20:25:31.3743649Z libcusparse-12.5.4.2 | 118.6 MB  | ##5        |  26% [A[A[A
2025-05-07T20:25:31.3812533Z nsight-compute-2024. | 443.1 MB  | 5          |   6% 
2025-05-07T20:25:31.3812788Z 
2025-05-07T20:25:31.3812793Z 
2025-05-07T20:25:31.3812796Z 
2025-05-07T20:25:31.3813958Z 
2025-05-07T20:25:31.4273961Z cuda-nsight-12.6.77  | 113.2 MB  | ###1       |  31% [A[A[A[A
2025-05-07T20:25:31.4275673Z 
2025-05-07T20:25:31.4509635Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  12% [A
2025-05-07T20:25:31.4509899Z 
2025-05-07T20:25:31.4511109Z 
2025-05-07T20:25:31.4726644Z libcufft-11.3.0.4    | 156.2 MB  | ##8        |  28% [A[A
2025-05-07T20:25:31.4726905Z 
2025-05-07T20:25:31.4726956Z 
2025-05-07T20:25:31.4728374Z 
2025-05-07T20:25:31.4745916Z libcusparse-12.5.4.2 | 118.6 MB  | ##8        |  29% [A[A[A
2025-05-07T20:25:31.5122875Z nsight-compute-2024. | 443.1 MB  | 6          |   7% 
2025-05-07T20:25:31.5123144Z 
2025-05-07T20:25:31.5123148Z 
2025-05-07T20:25:31.5123152Z 
2025-05-07T20:25:31.5126686Z 
2025-05-07T20:25:31.5277066Z cuda-nsight-12.6.77  | 113.2 MB  | ###5       |  35% [A[A[A[A
2025-05-07T20:25:31.5278243Z 
2025-05-07T20:25:31.5667805Z libcublas-12.6.4.1   | 256.2 MB  | #3         |  13% [A
2025-05-07T20:25:31.5668070Z 
2025-05-07T20:25:31.5670880Z 
2025-05-07T20:25:31.5746760Z libcufft-11.3.0.4    | 156.2 MB  | ###        |  31% [A[A
2025-05-07T20:25:31.5768981Z nsight-compute-2024. | 443.1 MB  | 7          |   8% 
2025-05-07T20:25:31.5769231Z 
2025-05-07T20:25:31.5769235Z 
2025-05-07T20:25:31.5769239Z 
2025-05-07T20:25:31.6244687Z libcusparse-12.5.4.2 | 118.6 MB  | ###2       |  32% [A[A[A
2025-05-07T20:25:31.6244974Z 
2025-05-07T20:25:31.6244979Z 
2025-05-07T20:25:31.6244982Z 
2025-05-07T20:25:31.6244986Z 
2025-05-07T20:25:31.6327821Z cuda-nsight-12.6.77  | 113.2 MB  | ###8       |  39% [A[A[A[A
2025-05-07T20:25:31.6328523Z 
2025-05-07T20:25:31.6669862Z libcublas-12.6.4.1   | 256.2 MB  | #4         |  15% [A
2025-05-07T20:25:31.6670264Z 
2025-05-07T20:25:31.6673214Z 
2025-05-07T20:25:31.6747144Z libcufft-11.3.0.4    | 156.2 MB  | ###3       |  33% [A[A
2025-05-07T20:25:31.6773986Z nsight-compute-2024. | 443.1 MB  | 8          |   8% 
2025-05-07T20:25:31.6774243Z 
2025-05-07T20:25:31.6774248Z 
2025-05-07T20:25:31.6775461Z 
2025-05-07T20:25:31.7247628Z libcusparse-12.5.4.2 | 118.6 MB  | ###5       |  36% [A[A[A
2025-05-07T20:25:31.7247922Z 
2025-05-07T20:25:31.7247949Z 
2025-05-07T20:25:31.7247953Z 
2025-05-07T20:25:31.7249139Z 
2025-05-07T20:25:31.7330416Z cuda-nsight-12.6.77  | 113.2 MB  | ####2      |  42% [A[A[A[A
2025-05-07T20:25:31.7333336Z 
2025-05-07T20:25:31.7670506Z libcublas-12.6.4.1   | 256.2 MB  | #6         |  16% [A
2025-05-07T20:25:31.7671292Z 
2025-05-07T20:25:31.7672802Z 
2025-05-07T20:25:31.7749727Z libcufft-11.3.0.4    | 156.2 MB  | ###5       |  36% [A[A
2025-05-07T20:25:31.7814395Z nsight-compute-2024. | 443.1 MB  | 9          |   9% 
2025-05-07T20:25:31.7814661Z 
2025-05-07T20:25:31.7814666Z 
2025-05-07T20:25:31.7816232Z 
2025-05-07T20:25:31.8252848Z libcusparse-12.5.4.2 | 118.6 MB  | ###8       |  39% [A[A[A
2025-05-07T20:25:31.8253186Z 
2025-05-07T20:25:31.8253190Z 
2025-05-07T20:25:31.8253202Z 
2025-05-07T20:25:31.8258482Z 
2025-05-07T20:25:31.8340083Z cuda-nsight-12.6.77  | 113.2 MB  | ####5      |  45% [A[A[A[A
2025-05-07T20:25:31.8340363Z 
2025-05-07T20:25:31.8705071Z libcublas-12.6.4.1   | 256.2 MB  | #7         |  18% [A
2025-05-07T20:25:31.8705332Z 
2025-05-07T20:25:31.8707916Z 
2025-05-07T20:25:31.8753079Z libcufft-11.3.0.4    | 156.2 MB  | ###8       |  38% [A[A
2025-05-07T20:25:31.8816176Z nsight-compute-2024. | 443.1 MB  | #          |  10% 
2025-05-07T20:25:31.8816436Z 
2025-05-07T20:25:31.8816440Z 
2025-05-07T20:25:31.8818412Z 
2025-05-07T20:25:31.9253753Z libcusparse-12.5.4.2 | 118.6 MB  | ####2      |  42% [A[A[A
2025-05-07T20:25:31.9254057Z 
2025-05-07T20:25:31.9254062Z 
2025-05-07T20:25:31.9254066Z 
2025-05-07T20:25:31.9254751Z 
2025-05-07T20:25:31.9357947Z cuda-nsight-12.6.77  | 113.2 MB  | ####8      |  49% [A[A[A[A
2025-05-07T20:25:31.9361248Z 
2025-05-07T20:25:31.9708053Z libcublas-12.6.4.1   | 256.2 MB  | #9         |  19% [A
2025-05-07T20:25:31.9708356Z 
2025-05-07T20:25:31.9709973Z 
2025-05-07T20:25:31.9833600Z libcufft-11.3.0.4    | 156.2 MB  | ####       |  41% [A[A
2025-05-07T20:25:31.9864610Z nsight-compute-2024. | 443.1 MB  | #1         |  11% 
2025-05-07T20:25:31.9864936Z 
2025-05-07T20:25:31.9864942Z 
2025-05-07T20:25:31.9865506Z 
2025-05-07T20:25:32.0276715Z libcusparse-12.5.4.2 | 118.6 MB  | ####5      |  46% [A[A[A
2025-05-07T20:25:32.0277291Z 
2025-05-07T20:25:32.0277297Z 
2025-05-07T20:25:32.0277300Z 
2025-05-07T20:25:32.0277722Z 
2025-05-07T20:25:32.0363557Z cuda-nsight-12.6.77  | 113.2 MB  | #####2     |  52% [A[A[A[A
2025-05-07T20:25:32.0365678Z 
2025-05-07T20:25:32.0715229Z libcublas-12.6.4.1   | 256.2 MB  | ##         |  21% [A
2025-05-07T20:25:32.0715543Z 
2025-05-07T20:25:32.0717980Z 
2025-05-07T20:25:32.0866314Z libcufft-11.3.0.4    | 156.2 MB  | ####3      |  43% [A[A
2025-05-07T20:25:32.0866680Z 
2025-05-07T20:25:32.0866955Z 
2025-05-07T20:25:32.0867086Z 
2025-05-07T20:25:32.1050341Z libcusparse-12.5.4.2 | 118.6 MB  | ####9      |  49% [A[A[A
2025-05-07T20:25:32.1283195Z nsight-compute-2024. | 443.1 MB  | #1         |  12% 
2025-05-07T20:25:32.1283558Z 
2025-05-07T20:25:32.1283564Z 
2025-05-07T20:25:32.1283569Z 
2025-05-07T20:25:32.1284442Z 
2025-05-07T20:25:32.1367070Z cuda-nsight-12.6.77  | 113.2 MB  | #####5     |  56% [A[A[A[A
2025-05-07T20:25:32.1367737Z 
2025-05-07T20:25:32.1716077Z libcublas-12.6.4.1   | 256.2 MB  | ##2        |  22% [A
2025-05-07T20:25:32.1716404Z 
2025-05-07T20:25:32.1717468Z 
2025-05-07T20:25:32.1869435Z libcufft-11.3.0.4    | 156.2 MB  | ####5      |  46% [A[A
2025-05-07T20:25:32.1869701Z 
2025-05-07T20:25:32.1869706Z 
2025-05-07T20:25:32.1870140Z 
2025-05-07T20:25:32.2050875Z libcusparse-12.5.4.2 | 118.6 MB  | #####2     |  53% [A[A[A
2025-05-07T20:25:32.2287363Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:25:32.2287722Z 
2025-05-07T20:25:32.2287728Z 
2025-05-07T20:25:32.2287749Z 
2025-05-07T20:25:32.2289021Z 
2025-05-07T20:25:32.2367000Z cuda-nsight-12.6.77  | 113.2 MB  | #####9     |  59% [A[A[A[A
2025-05-07T20:25:32.2367880Z 
2025-05-07T20:25:32.2746961Z libcublas-12.6.4.1   | 256.2 MB  | ##3        |  24% [A
2025-05-07T20:25:32.2747313Z 
2025-05-07T20:25:32.2748409Z 
2025-05-07T20:25:32.2967851Z libcufft-11.3.0.4    | 156.2 MB  | ####8      |  48% [A[A
2025-05-07T20:25:32.2968190Z 
2025-05-07T20:25:32.2968196Z 
2025-05-07T20:25:32.2968651Z 
2025-05-07T20:25:32.3096921Z libcusparse-12.5.4.2 | 118.6 MB  | #####5     |  56% [A[A[A
2025-05-07T20:25:32.3398408Z nsight-compute-2024. | 443.1 MB  | #3         |  13% 
2025-05-07T20:25:32.3398760Z 
2025-05-07T20:25:32.3398766Z 
2025-05-07T20:25:32.3398772Z 
2025-05-07T20:25:32.3405147Z 
2025-05-07T20:25:32.3527598Z cuda-nsight-12.6.77  | 113.2 MB  | ######2    |  63% [A[A[A[A
2025-05-07T20:25:32.3529448Z 
2025-05-07T20:25:32.3936418Z libcublas-12.6.4.1   | 256.2 MB  | ##5        |  25% [A
2025-05-07T20:25:32.3936700Z 
2025-05-07T20:25:32.3938261Z 
2025-05-07T20:25:32.4023256Z libcufft-11.3.0.4    | 156.2 MB  | #####      |  51% [A[A
2025-05-07T20:25:32.4023524Z 
2025-05-07T20:25:32.4023528Z 
2025-05-07T20:25:32.4026717Z 
2025-05-07T20:25:32.4109741Z libcusparse-12.5.4.2 | 118.6 MB  | #####9     |  59% [A[A[A
2025-05-07T20:25:32.4475087Z nsight-compute-2024. | 443.1 MB  | #4         |  14% 
2025-05-07T20:25:32.4475388Z 
2025-05-07T20:25:32.4475392Z 
2025-05-07T20:25:32.4475396Z 
2025-05-07T20:25:32.4477351Z 
2025-05-07T20:25:32.4604776Z cuda-nsight-12.6.77  | 113.2 MB  | ######5    |  66% [A[A[A[A
2025-05-07T20:25:32.4606602Z 
2025-05-07T20:25:32.4971135Z libcublas-12.6.4.1   | 256.2 MB  | ##6        |  27% [A
2025-05-07T20:25:32.4971396Z 
2025-05-07T20:25:32.4972120Z 
2025-05-07T20:25:32.5110163Z libcufft-11.3.0.4    | 156.2 MB  | #####3     |  53% [A[A
2025-05-07T20:25:32.5213902Z nsight-compute-2024. | 443.1 MB  | #4         |  15% 
2025-05-07T20:25:32.5214161Z 
2025-05-07T20:25:32.5214165Z 
2025-05-07T20:25:32.5216394Z 
2025-05-07T20:25:32.5477844Z libcusparse-12.5.4.2 | 118.6 MB  | ######2    |  63% [A[A[A
2025-05-07T20:25:32.5478128Z 
2025-05-07T20:25:32.5478133Z 
2025-05-07T20:25:32.5478136Z 
2025-05-07T20:25:32.5478140Z 
2025-05-07T20:25:32.5604964Z cuda-nsight-12.6.77  | 113.2 MB  | ######9    |  69% [A[A[A[A
2025-05-07T20:25:32.5605257Z 
2025-05-07T20:25:32.6055653Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  28% [A
2025-05-07T20:25:32.6055923Z 
2025-05-07T20:25:32.6059014Z 
2025-05-07T20:25:32.6128311Z libcufft-11.3.0.4    | 156.2 MB  | #####5     |  56% [A[A
2025-05-07T20:25:32.6220825Z nsight-compute-2024. | 443.1 MB  | #5         |  16% 
2025-05-07T20:25:32.6221080Z 
2025-05-07T20:25:32.6221085Z 
2025-05-07T20:25:32.6223707Z 
2025-05-07T20:25:32.6553720Z libcusparse-12.5.4.2 | 118.6 MB  | ######5    |  66% [A[A[A
2025-05-07T20:25:32.6554108Z 
2025-05-07T20:25:32.6554113Z 
2025-05-07T20:25:32.6554119Z 
2025-05-07T20:25:32.6554124Z 
2025-05-07T20:25:32.6654049Z cuda-nsight-12.6.77  | 113.2 MB  | #######2   |  73% [A[A[A[A
2025-05-07T20:25:32.6655124Z 
2025-05-07T20:25:32.7128275Z libcublas-12.6.4.1   | 256.2 MB  | ##9        |  30% [A
2025-05-07T20:25:32.7128665Z 
2025-05-07T20:25:32.7128858Z 
2025-05-07T20:25:32.7140902Z libcufft-11.3.0.4    | 156.2 MB  | #####8     |  58% [A[A
2025-05-07T20:25:32.7221585Z nsight-compute-2024. | 443.1 MB  | #6         |  17% 
2025-05-07T20:25:32.7221937Z 
2025-05-07T20:25:32.7221952Z 
2025-05-07T20:25:32.7223489Z 
2025-05-07T20:25:32.7569141Z libcusparse-12.5.4.2 | 118.6 MB  | ######8    |  69% [A[A[A
2025-05-07T20:25:32.7569518Z 
2025-05-07T20:25:32.7569524Z 
2025-05-07T20:25:32.7569529Z 
2025-05-07T20:25:32.7569544Z 
2025-05-07T20:25:32.7654852Z cuda-nsight-12.6.77  | 113.2 MB  | #######5   |  76% [A[A[A[A
2025-05-07T20:25:32.7655904Z 
2025-05-07T20:25:32.8154734Z libcublas-12.6.4.1   | 256.2 MB  | ###1       |  31% [A
2025-05-07T20:25:32.8155083Z 
2025-05-07T20:25:32.8157064Z 
2025-05-07T20:25:32.8270647Z libcufft-11.3.0.4    | 156.2 MB  | ######     |  60% [A[A
2025-05-07T20:25:32.8278134Z nsight-compute-2024. | 443.1 MB  | #7         |  17% 
2025-05-07T20:25:32.8278491Z 
2025-05-07T20:25:32.8278506Z 
2025-05-07T20:25:32.8278511Z 
2025-05-07T20:25:32.8574381Z libcusparse-12.5.4.2 | 118.6 MB  | #######2   |  72% [A[A[A
2025-05-07T20:25:32.8574757Z 
2025-05-07T20:25:32.8574763Z 
2025-05-07T20:25:32.8574768Z 
2025-05-07T20:25:32.8574783Z 
2025-05-07T20:25:32.8727591Z cuda-nsight-12.6.77  | 113.2 MB  | #######8   |  79% [A[A[A[A
2025-05-07T20:25:32.8730682Z 
2025-05-07T20:25:32.9164112Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  33% [A
2025-05-07T20:25:32.9164462Z 
2025-05-07T20:25:32.9166972Z 
2025-05-07T20:25:32.9274263Z libcufft-11.3.0.4    | 156.2 MB  | ######2    |  63% [A[A
2025-05-07T20:25:32.9280509Z nsight-compute-2024. | 443.1 MB  | #8         |  18% 
2025-05-07T20:25:32.9280764Z 
2025-05-07T20:25:32.9280769Z 
2025-05-07T20:25:32.9280795Z 
2025-05-07T20:25:32.9576075Z libcusparse-12.5.4.2 | 118.6 MB  | #######5   |  75% [A[A[A
2025-05-07T20:25:32.9576364Z 
2025-05-07T20:25:32.9576369Z 
2025-05-07T20:25:32.9576372Z 
2025-05-07T20:25:32.9578209Z 
2025-05-07T20:25:32.9769686Z cuda-nsight-12.6.77  | 113.2 MB  | ########2  |  82% [A[A[A[A
2025-05-07T20:25:32.9769964Z 
2025-05-07T20:25:33.0210236Z libcublas-12.6.4.1   | 256.2 MB  | ###3       |  34% [A
2025-05-07T20:25:33.0210504Z 
2025-05-07T20:25:33.0210509Z 
2025-05-07T20:25:33.0275870Z libcufft-11.3.0.4    | 156.2 MB  | ######4    |  65% [A[A
2025-05-07T20:25:33.0313593Z nsight-compute-2024. | 443.1 MB  | #8         |  19% 
2025-05-07T20:25:33.0313862Z 
2025-05-07T20:25:33.0313866Z 
2025-05-07T20:25:33.0316465Z 
2025-05-07T20:25:33.0577595Z libcusparse-12.5.4.2 | 118.6 MB  | #######8   |  78% [A[A[A
2025-05-07T20:25:33.0577875Z 
2025-05-07T20:25:33.0577888Z 
2025-05-07T20:25:33.0577891Z 
2025-05-07T20:25:33.0580225Z 
2025-05-07T20:25:33.0798151Z cuda-nsight-12.6.77  | 113.2 MB  | ########5  |  86% [A[A[A[A
2025-05-07T20:25:33.0799728Z 
2025-05-07T20:25:33.1239045Z libcublas-12.6.4.1   | 256.2 MB  | ###5       |  35% [A
2025-05-07T20:25:33.1239314Z 
2025-05-07T20:25:33.1239833Z 
2025-05-07T20:25:33.1334413Z libcufft-11.3.0.4    | 156.2 MB  | ######7    |  67% [A[A
2025-05-07T20:25:33.1334706Z 
2025-05-07T20:25:33.1334712Z 
2025-05-07T20:25:33.1336125Z 
2025-05-07T20:25:33.1340356Z libcusparse-12.5.4.2 | 118.6 MB  | ########1  |  81% [A[A[A
2025-05-07T20:25:33.1592778Z nsight-compute-2024. | 443.1 MB  | #9         |  20% 
2025-05-07T20:25:33.1593283Z 
2025-05-07T20:25:33.1593297Z 
2025-05-07T20:25:33.1593302Z 
2025-05-07T20:25:33.1593977Z 
2025-05-07T20:25:33.1800504Z cuda-nsight-12.6.77  | 113.2 MB  | ########8  |  89% [A[A[A[A
2025-05-07T20:25:33.1801916Z 
2025-05-07T20:25:33.2243705Z libcublas-12.6.4.1   | 256.2 MB  | ###6       |  37% [A
2025-05-07T20:25:33.2244031Z 
2025-05-07T20:25:33.2244715Z 
2025-05-07T20:25:33.2350322Z libcufft-11.3.0.4    | 156.2 MB  | ######9    |  69% [A[A
2025-05-07T20:25:33.2479015Z nsight-compute-2024. | 443.1 MB  | ##         |  21% 
2025-05-07T20:25:33.2479556Z 
2025-05-07T20:25:33.2479560Z 
2025-05-07T20:25:33.2480233Z 
2025-05-07T20:25:33.2681406Z libcusparse-12.5.4.2 | 118.6 MB  | ########4  |  85% [A[A[A
2025-05-07T20:25:33.2681680Z 
2025-05-07T20:25:33.2681684Z 
2025-05-07T20:25:33.2681688Z 
2025-05-07T20:25:33.2684224Z 
2025-05-07T20:25:33.2809890Z cuda-nsight-12.6.77  | 113.2 MB  | #########1 |  92% [A[A[A[A
2025-05-07T20:25:33.2810166Z 
2025-05-07T20:25:33.3390469Z libcublas-12.6.4.1   | 256.2 MB  | ###8       |  38% [A
2025-05-07T20:25:33.3411409Z nsight-compute-2024. | 443.1 MB  | ##1        |  21% 
2025-05-07T20:25:33.3411650Z 
2025-05-07T20:25:33.3413714Z 
2025-05-07T20:25:33.3611029Z libcufft-11.3.0.4    | 156.2 MB  | #######1   |  72% [A[A
2025-05-07T20:25:33.3611284Z 
2025-05-07T20:25:33.3611525Z 
2025-05-07T20:25:33.3611540Z 
2025-05-07T20:25:33.3811106Z libcusparse-12.5.4.2 | 118.6 MB  | ########7  |  88% [A[A[A
2025-05-07T20:25:33.3811415Z 
2025-05-07T20:25:33.3811421Z 
2025-05-07T20:25:33.3811462Z 
2025-05-07T20:25:33.3814488Z 
2025-05-07T20:25:33.3843501Z cuda-nsight-12.6.77  | 113.2 MB  | #########5 |  95% [A[A[A[A
2025-05-07T20:25:33.3843920Z 
2025-05-07T20:25:33.4417036Z libcublas-12.6.4.1   | 256.2 MB  | ###9       |  40% [A
2025-05-07T20:25:33.4538117Z nsight-compute-2024. | 443.1 MB  | ##2        |  22% 
2025-05-07T20:25:33.4538518Z 
2025-05-07T20:25:33.4541019Z 
2025-05-07T20:25:33.4705705Z libcufft-11.3.0.4    | 156.2 MB  | #######3   |  74% [A[A
2025-05-07T20:25:33.4705968Z 
2025-05-07T20:25:33.4705979Z 
2025-05-07T20:25:33.4705984Z 
2025-05-07T20:25:33.4845979Z libcusparse-12.5.4.2 | 118.6 MB  | #########  |  90% [A[A[A
2025-05-07T20:25:33.4846252Z 
2025-05-07T20:25:33.4891822Z libcublas-12.6.4.1   | 256.2 MB  | ####1      |  41% [A
2025-05-07T20:25:33.4892074Z 
2025-05-07T20:25:33.4892078Z 
2025-05-07T20:25:33.4892082Z 
2025-05-07T20:25:33.4893586Z 
2025-05-07T20:25:33.5462443Z cuda-nsight-12.6.77  | 113.2 MB  | #########8 |  98% [A[A[A[A
2025-05-07T20:25:33.5586080Z nsight-compute-2024. | 443.1 MB  | ##2        |  23% 
2025-05-07T20:25:33.5586321Z 
2025-05-07T20:25:33.5586331Z 
2025-05-07T20:25:33.5838316Z libcufft-11.3.0.4    | 156.2 MB  | #######5   |  76% [A[A
2025-05-07T20:25:33.5838790Z 
2025-05-07T20:25:33.5838794Z 
2025-05-07T20:25:33.5839918Z 
2025-05-07T20:25:33.5852968Z libcusparse-12.5.4.2 | 118.6 MB  | #########3 |  93% [A[A[A
2025-05-07T20:25:33.5853245Z 
2025-05-07T20:25:33.6466702Z libcublas-12.6.4.1   | 256.2 MB  | ####2      |  42% [A
2025-05-07T20:25:33.6601040Z nsight-compute-2024. | 443.1 MB  | ##3        |  24% 
2025-05-07T20:25:33.6601292Z 
2025-05-07T20:25:33.6602699Z 
2025-05-07T20:25:33.6845482Z libcufft-11.3.0.4    | 156.2 MB  | #######7   |  78% [A[A
2025-05-07T20:25:33.6845742Z 
2025-05-07T20:25:33.6845746Z 
2025-05-07T20:25:33.6845750Z 
2025-05-07T20:25:33.6856312Z libcusparse-12.5.4.2 | 118.6 MB  | #########5 |  96% [A[A[A
2025-05-07T20:25:33.6856580Z 
2025-05-07T20:25:33.7551647Z libcublas-12.6.4.1   | 256.2 MB  | ####3      |  44% [A
2025-05-07T20:25:33.7604099Z nsight-compute-2024. | 443.1 MB  | ##4        |  24% 
2025-05-07T20:25:33.7604354Z 
2025-05-07T20:25:33.7606102Z 
2025-05-07T20:25:33.7884103Z libcufft-11.3.0.4    | 156.2 MB  | ########   |  80% [A[A
2025-05-07T20:25:33.7884366Z 
2025-05-07T20:25:33.7884370Z 
2025-05-07T20:25:33.7888142Z 
2025-05-07T20:25:33.8000455Z libcusparse-12.5.4.2 | 118.6 MB  | #########8 |  99% [A[A[A
2025-05-07T20:25:33.8004856Z 
2025-05-07T20:25:33.8605613Z libcublas-12.6.4.1   | 256.2 MB  | ####5      |  45% [A
2025-05-07T20:25:33.8605868Z 
2025-05-07T20:25:33.8607459Z 
2025-05-07T20:25:33.8610273Z libcufft-11.3.0.4    | 156.2 MB  | ########2  |  82% [A[A
2025-05-07T20:25:33.9087526Z nsight-compute-2024. | 443.1 MB  | ##5        |  25% 
2025-05-07T20:25:33.9087778Z 
2025-05-07T20:25:33.9606089Z libcublas-12.6.4.1   | 256.2 MB  | ####6      |  47% [A
2025-05-07T20:25:33.9606390Z 
2025-05-07T20:25:33.9609611Z 
2025-05-07T20:25:33.9623074Z libcufft-11.3.0.4    | 156.2 MB  | ########4  |  84% [A[A
2025-05-07T20:25:34.0106705Z nsight-compute-2024. | 443.1 MB  | ##5        |  26% 
2025-05-07T20:25:34.0108920Z 
2025-05-07T20:25:34.0607696Z libcublas-12.6.4.1   | 256.2 MB  | ####8      |  48% [A
2025-05-07T20:25:34.0607951Z 
2025-05-07T20:25:34.0610933Z 
2025-05-07T20:25:34.0624770Z libcufft-11.3.0.4    | 156.2 MB  | ########6  |  87% [A[A
2025-05-07T20:25:34.1107044Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:25:34.1108090Z 
2025-05-07T20:25:34.1613974Z libcublas-12.6.4.1   | 256.2 MB  | ####9      |  49% [A
2025-05-07T20:25:34.1614242Z 
2025-05-07T20:25:34.1614883Z 
2025-05-07T20:25:34.1627122Z libcufft-11.3.0.4    | 156.2 MB  | ########9  |  89% [A[A
2025-05-07T20:25:34.2107187Z nsight-compute-2024. | 443.1 MB  | ##7        |  28% 
2025-05-07T20:25:34.2108160Z 
2025-05-07T20:25:34.2614594Z libcublas-12.6.4.1   | 256.2 MB  | #####      |  51% [A
2025-05-07T20:25:34.2614846Z 
2025-05-07T20:25:34.2616144Z 
2025-05-07T20:25:34.2629805Z libcufft-11.3.0.4    | 156.2 MB  | #########1 |  92% [A[A
2025-05-07T20:25:34.3108292Z nsight-compute-2024. | 443.1 MB  | ##8        |  28% 
2025-05-07T20:25:34.3109727Z 
2025-05-07T20:25:34.3630366Z libcublas-12.6.4.1   | 256.2 MB  | #####2     |  52% [A
2025-05-07T20:25:34.3630627Z 
2025-05-07T20:25:34.3631345Z 
2025-05-07T20:25:34.3702381Z libcufft-11.3.0.4    | 156.2 MB  | #########4 |  94% [A[A
2025-05-07T20:25:34.4185569Z nsight-compute-2024. | 443.1 MB  | ##9        |  29% 
2025-05-07T20:25:34.4187312Z 
2025-05-07T20:25:34.4660002Z libcublas-12.6.4.1   | 256.2 MB  | #####3     |  54% [A
2025-05-07T20:25:34.4660254Z 
2025-05-07T20:25:34.4660344Z 
2025-05-07T20:25:34.4739829Z libcufft-11.3.0.4    | 156.2 MB  | #########6 |  97% [A[A
2025-05-07T20:25:34.5208179Z nsight-compute-2024. | 443.1 MB  | ##9        |  30% 
2025-05-07T20:25:34.5208635Z 
2025-05-07T20:25:34.5686313Z libcublas-12.6.4.1   | 256.2 MB  | #####5     |  55% [A
2025-05-07T20:25:34.5686574Z 
2025-05-07T20:25:34.5687102Z 
2025-05-07T20:25:34.5746777Z libcufft-11.3.0.4    | 156.2 MB  | #########8 |  99% [A[A
2025-05-07T20:25:34.6208344Z nsight-compute-2024. | 443.1 MB  | ###        |  31% 
2025-05-07T20:25:34.6210114Z 
2025-05-07T20:25:34.6754727Z libcublas-12.6.4.1   | 256.2 MB  | #####6     |  57% [A
2025-05-07T20:25:34.7208642Z nsight-compute-2024. | 443.1 MB  | ###1       |  32% 
2025-05-07T20:25:34.7209039Z 
2025-05-07T20:25:34.7754579Z libcublas-12.6.4.1   | 256.2 MB  | #####8     |  59% [A
2025-05-07T20:25:34.8209182Z nsight-compute-2024. | 443.1 MB  | ###2       |  33% 
2025-05-07T20:25:34.8209542Z 
2025-05-07T20:25:34.9149870Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  61% [A
2025-05-07T20:25:34.9210397Z nsight-compute-2024. | 443.1 MB  | ###3       |  33% 
2025-05-07T20:25:34.9212123Z 
2025-05-07T20:25:35.0153194Z libcublas-12.6.4.1   | 256.2 MB  | ######3    |  64% [A
2025-05-07T20:25:35.0223920Z nsight-compute-2024. | 443.1 MB  | ###4       |  34% 
2025-05-07T20:25:35.0225331Z 
2025-05-07T20:25:35.1224106Z libcublas-12.6.4.1   | 256.2 MB  | ######5    |  66% [A
2025-05-07T20:25:35.1225122Z 
2025-05-07T20:25:35.1464094Z libcublas-12.6.4.1   | 256.2 MB  | ######8    |  69% [A
2025-05-07T20:25:35.2326684Z nsight-compute-2024. | 443.1 MB  | ###5       |  35% 
2025-05-07T20:25:35.2326990Z 
2025-05-07T20:25:35.2464854Z libcublas-12.6.4.1   | 256.2 MB  | #######1   |  71% [A
2025-05-07T20:25:35.3399634Z nsight-compute-2024. | 443.1 MB  | ###6       |  36% 
2025-05-07T20:25:35.3399943Z 
2025-05-07T20:25:35.3465971Z libcublas-12.6.4.1   | 256.2 MB  | #######3   |  73% [A
2025-05-07T20:25:35.4401298Z nsight-compute-2024. | 443.1 MB  | ###7       |  37% 
2025-05-07T20:25:35.4402211Z 
2025-05-07T20:25:35.4467080Z libcublas-12.6.4.1   | 256.2 MB  | #######5   |  76% [A
2025-05-07T20:25:35.5467166Z nsight-compute-2024. | 443.1 MB  | ###8       |  38% 
2025-05-07T20:25:35.5613440Z nsight-compute-2024. | 443.1 MB  | ###9       |  39% 
2025-05-07T20:25:35.5614284Z 
2025-05-07T20:25:35.6473004Z libcublas-12.6.4.1   | 256.2 MB  | #######7   |  78% [A
2025-05-07T20:25:35.6613684Z nsight-compute-2024. | 443.1 MB  | ####       |  40% 
2025-05-07T20:25:35.6614461Z 
2025-05-07T20:25:35.7440625Z libcublas-12.6.4.1   | 256.2 MB  | ########   |  80% [A
2025-05-07T20:25:35.7440959Z 
2025-05-07T20:25:35.7440963Z 
2025-05-07T20:25:35.7440967Z 
2025-05-07T20:25:35.7440971Z 
2025-05-07T20:25:35.7614397Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:35.7615192Z 
2025-05-07T20:25:35.7685254Z libcublas-12.6.4.1   | 256.2 MB  | ########2  |  83% [A
2025-05-07T20:25:35.7839966Z nsight-compute-2024. | 443.1 MB  | ####1      |  42% 
2025-05-07T20:25:35.7840313Z 
2025-05-07T20:25:35.7840317Z 
2025-05-07T20:25:35.7840321Z 
2025-05-07T20:25:35.7840324Z 
2025-05-07T20:25:35.7841891Z 
2025-05-07T20:25:35.8827066Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:35.8839783Z nsight-compute-2024. | 443.1 MB  | ####2      |  42% 
2025-05-07T20:25:35.8840156Z 
2025-05-07T20:25:35.8840164Z 
2025-05-07T20:25:35.8840196Z 
2025-05-07T20:25:35.8840201Z 
2025-05-07T20:25:35.8840211Z 
2025-05-07T20:25:35.9120779Z cuda-nvvp-12.6.80    | 109.3 MB  | 3          |   4% [A[A[A[A[A
2025-05-07T20:25:35.9121156Z 
2025-05-07T20:25:35.9844639Z libcublas-12.6.4.1   | 256.2 MB  | ########4  |  85% [A
2025-05-07T20:25:35.9844915Z 
2025-05-07T20:25:35.9844919Z 
2025-05-07T20:25:35.9844923Z 
2025-05-07T20:25:35.9844927Z 
2025-05-07T20:25:35.9847342Z 
2025-05-07T20:25:35.9870365Z cuda-nvvp-12.6.80    | 109.3 MB  | 7          |   7% [A[A[A[A[A
2025-05-07T20:25:36.0601236Z nsight-compute-2024. | 443.1 MB  | ####3      |  43% 
2025-05-07T20:25:36.0601569Z 
2025-05-07T20:25:36.0845320Z libcublas-12.6.4.1   | 256.2 MB  | ########6  |  87% [A
2025-05-07T20:25:36.0845580Z 
2025-05-07T20:25:36.0845597Z 
2025-05-07T20:25:36.0845601Z 
2025-05-07T20:25:36.0845605Z 
2025-05-07T20:25:36.0847408Z 
2025-05-07T20:25:36.0908128Z cuda-nvvp-12.6.80    | 109.3 MB  | #          |  10% [A[A[A[A[A
2025-05-07T20:25:36.1671713Z nsight-compute-2024. | 443.1 MB  | ####4      |  44% 
2025-05-07T20:25:36.1672267Z 
2025-05-07T20:25:36.1672275Z 
2025-05-07T20:25:36.1672283Z 
2025-05-07T20:25:36.1846883Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:36.1847244Z 
2025-05-07T20:25:36.1847248Z 
2025-05-07T20:25:36.1847252Z 
2025-05-07T20:25:36.1847255Z 
2025-05-07T20:25:36.1847259Z 
2025-05-07T20:25:36.1949923Z cuda-nvvp-12.6.80    | 109.3 MB  | #3         |  13% [A[A[A[A[A
2025-05-07T20:25:36.1950281Z 
2025-05-07T20:25:36.2019195Z libcublas-12.6.4.1   | 256.2 MB  | ########8  |  89% [A
2025-05-07T20:25:36.2181264Z nsight-compute-2024. | 443.1 MB  | ####5      |  45% 
2025-05-07T20:25:36.2181576Z 
2025-05-07T20:25:36.2181580Z 
2025-05-07T20:25:36.2181584Z 
2025-05-07T20:25:36.2181587Z 
2025-05-07T20:25:36.2181591Z 
2025-05-07T20:25:36.2189983Z 
2025-05-07T20:25:36.3062574Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:36.3062883Z 
2025-05-07T20:25:36.3062888Z 
2025-05-07T20:25:36.3062904Z 
2025-05-07T20:25:36.3062908Z 
2025-05-07T20:25:36.3070336Z 
2025-05-07T20:25:36.3187951Z cuda-nvvp-12.6.80    | 109.3 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:25:36.3188244Z 
2025-05-07T20:25:36.3188248Z 
2025-05-07T20:25:36.3188252Z 
2025-05-07T20:25:36.3188256Z 
2025-05-07T20:25:36.3188259Z 
2025-05-07T20:25:36.3192794Z 
2025-05-07T20:25:36.3403555Z libcusolver-11.7.1.2 | 95.8 MB   | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:25:36.3503318Z nsight-compute-2024. | 443.1 MB  | ####6      |  46% 
2025-05-07T20:25:36.3503594Z 
2025-05-07T20:25:36.4188297Z libcublas-12.6.4.1   | 256.2 MB  | #########  |  91% [A
2025-05-07T20:25:36.4188578Z 
2025-05-07T20:25:36.4188583Z 
2025-05-07T20:25:36.4188586Z 
2025-05-07T20:25:36.4188590Z 
2025-05-07T20:25:36.4188594Z 
2025-05-07T20:25:36.4190393Z 
2025-05-07T20:25:36.4226105Z libcusolver-11.7.1.2 | 95.8 MB   | 5          |   5% [A[A[A[A[A[A
2025-05-07T20:25:36.4226407Z 
2025-05-07T20:25:36.4226410Z 
2025-05-07T20:25:36.4226651Z 
2025-05-07T20:25:36.4226657Z 
2025-05-07T20:25:36.4228407Z 
2025-05-07T20:25:36.4796366Z cuda-nvvp-12.6.80    | 109.3 MB  | #9         |  19% [A[A[A[A[A
2025-05-07T20:25:36.5136902Z nsight-compute-2024. | 443.1 MB  | ####6      |  47% 
2025-05-07T20:25:36.5137243Z 
2025-05-07T20:25:36.5188180Z libcublas-12.6.4.1   | 256.2 MB  | #########2 |  92% [A
2025-05-07T20:25:36.5188510Z 
2025-05-07T20:25:36.5188516Z 
2025-05-07T20:25:36.5188521Z 
2025-05-07T20:25:36.5188527Z 
2025-05-07T20:25:36.5188551Z 
2025-05-07T20:25:36.5188556Z 
2025-05-07T20:25:36.5470701Z libcusolver-11.7.1.2 | 95.8 MB   | 7          |   7% [A[A[A[A[A[A
2025-05-07T20:25:36.5470999Z 
2025-05-07T20:25:36.5471003Z 
2025-05-07T20:25:36.5471007Z 
2025-05-07T20:25:36.5471011Z 
2025-05-07T20:25:36.5477206Z 
2025-05-07T20:25:36.6044439Z cuda-nvvp-12.6.80    | 109.3 MB  | ##2        |  22% [A[A[A[A[A
2025-05-07T20:25:36.6198219Z nsight-compute-2024. | 443.1 MB  | ####7      |  48% 
2025-05-07T20:25:36.6198580Z 
2025-05-07T20:25:36.6198607Z 
2025-05-07T20:25:36.6198611Z 
2025-05-07T20:25:36.6198615Z 
2025-05-07T20:25:36.6198619Z 
2025-05-07T20:25:36.6200619Z 
2025-05-07T20:25:36.6475217Z libcusolver-11.7.1.2 | 95.8 MB   | 9          |  10% [A[A[A[A[A[A
2025-05-07T20:25:36.6475513Z 
2025-05-07T20:25:36.6475517Z 
2025-05-07T20:25:36.6475521Z 
2025-05-07T20:25:36.6475525Z 
2025-05-07T20:25:36.6475529Z 
2025-05-07T20:25:36.6579236Z cuda-nvvp-12.6.80    | 109.3 MB  | ##4        |  25% [A[A[A[A[A
2025-05-07T20:25:36.6579558Z 
2025-05-07T20:25:36.7140892Z libcublas-12.6.4.1   | 256.2 MB  | #########3 |  93% [A
2025-05-07T20:25:36.7200446Z nsight-compute-2024. | 443.1 MB  | ####8      |  48% 
2025-05-07T20:25:36.7200747Z 
2025-05-07T20:25:36.7200752Z 
2025-05-07T20:25:36.7200755Z 
2025-05-07T20:25:36.7200759Z 
2025-05-07T20:25:36.7200763Z 
2025-05-07T20:25:36.7202267Z 
2025-05-07T20:25:36.7478826Z libcusolver-11.7.1.2 | 95.8 MB   | #2         |  12% [A[A[A[A[A[A
2025-05-07T20:25:36.7479123Z 
2025-05-07T20:25:36.7479128Z 
2025-05-07T20:25:36.7479149Z 
2025-05-07T20:25:36.7479153Z 
2025-05-07T20:25:36.7479262Z 
2025-05-07T20:25:36.7931006Z cuda-nvvp-12.6.80    | 109.3 MB  | ##7        |  27% [A[A[A[A[A
2025-05-07T20:25:36.7931366Z 
2025-05-07T20:25:36.8140738Z libcublas-12.6.4.1   | 256.2 MB  | #########4 |  95% [A
2025-05-07T20:25:36.8206432Z nsight-compute-2024. | 443.1 MB  | ####9      |  49% 
2025-05-07T20:25:36.8206715Z 
2025-05-07T20:25:36.8206719Z 
2025-05-07T20:25:36.8206723Z 
2025-05-07T20:25:36.8206744Z 
2025-05-07T20:25:36.8206748Z 
2025-05-07T20:25:36.8209215Z 
2025-05-07T20:25:36.8481429Z libcusolver-11.7.1.2 | 95.8 MB   | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:25:36.8481724Z 
2025-05-07T20:25:36.8481728Z 
2025-05-07T20:25:36.8481732Z 
2025-05-07T20:25:36.8481737Z 
2025-05-07T20:25:36.8483983Z 
2025-05-07T20:25:36.9185554Z cuda-nvvp-12.6.80    | 109.3 MB  | ###        |  30% [A[A[A[A[A
2025-05-07T20:25:36.9211303Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:25:36.9211590Z 
2025-05-07T20:25:36.9211613Z 
2025-05-07T20:25:36.9211617Z 
2025-05-07T20:25:36.9211621Z 
2025-05-07T20:25:36.9211625Z 
2025-05-07T20:25:36.9213776Z 
2025-05-07T20:25:36.9227446Z libcusolver-11.7.1.2 | 95.8 MB   | #7         |  17% [A[A[A[A[A[A
2025-05-07T20:25:36.9227779Z 
2025-05-07T20:25:36.9482467Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  96% [A
2025-05-07T20:25:36.9482811Z 
2025-05-07T20:25:36.9482817Z 
2025-05-07T20:25:36.9482822Z 
2025-05-07T20:25:36.9482827Z 
2025-05-07T20:25:36.9485741Z 
2025-05-07T20:25:37.0186553Z cuda-nvvp-12.6.80    | 109.3 MB  | ###2       |  33% [A[A[A[A[A
2025-05-07T20:25:37.0212413Z nsight-compute-2024. | 443.1 MB  | #####      |  50% 
2025-05-07T20:25:37.0212749Z 
2025-05-07T20:25:37.0212753Z 
2025-05-07T20:25:37.0212757Z 
2025-05-07T20:25:37.0212761Z 
2025-05-07T20:25:37.0212765Z 
2025-05-07T20:25:37.0216079Z 
2025-05-07T20:25:37.0408562Z libcusolver-11.7.1.2 | 95.8 MB   | #9         |  20% [A[A[A[A[A[A
2025-05-07T20:25:37.0413091Z 
2025-05-07T20:25:37.0524843Z libcublas-12.6.4.1   | 256.2 MB  | #########6 |  97% [A
2025-05-07T20:25:37.0525134Z 
2025-05-07T20:25:37.0525138Z 
2025-05-07T20:25:37.0525141Z 
2025-05-07T20:25:37.0525145Z 
2025-05-07T20:25:37.0526660Z 
2025-05-07T20:25:37.1213760Z cuda-nvvp-12.6.80    | 109.3 MB  | ###5       |  36% [A[A[A[A[A
2025-05-07T20:25:37.1214125Z 
2025-05-07T20:25:37.1214129Z 
2025-05-07T20:25:37.1214133Z 
2025-05-07T20:25:37.1214137Z 
2025-05-07T20:25:37.1214141Z 
2025-05-07T20:25:37.1214161Z 
2025-05-07T20:25:37.1341434Z libcusolver-11.7.1.2 | 95.8 MB   | ##2        |  22% [A[A[A[A[A[A
2025-05-07T20:25:37.1408474Z nsight-compute-2024. | 443.1 MB  | #####1     |  51% 
2025-05-07T20:25:37.1408827Z 
2025-05-07T20:25:37.1526758Z libcublas-12.6.4.1   | 256.2 MB  | #########7 |  98% [A
2025-05-07T20:25:37.1527050Z 
2025-05-07T20:25:37.1527054Z 
2025-05-07T20:25:37.1527058Z 
2025-05-07T20:25:37.1527061Z 
2025-05-07T20:25:37.1528979Z 
2025-05-07T20:25:37.2216761Z cuda-nvvp-12.6.80    | 109.3 MB  | ###8       |  39% [A[A[A[A[A
2025-05-07T20:25:37.2217068Z 
2025-05-07T20:25:37.2217072Z 
2025-05-07T20:25:37.2217076Z 
2025-05-07T20:25:37.2217079Z 
2025-05-07T20:25:37.2217083Z 
2025-05-07T20:25:37.2221097Z 
2025-05-07T20:25:37.2343115Z libcusolver-11.7.1.2 | 95.8 MB   | ##4        |  25% [A[A[A[A[A[A
2025-05-07T20:25:37.2521586Z nsight-compute-2024. | 443.1 MB  | #####1     |  52% 
2025-05-07T20:25:37.2524078Z 
2025-05-07T20:25:37.2532213Z libcublas-12.6.4.1   | 256.2 MB  | #########8 |  99% [A
2025-05-07T20:25:37.2532470Z 
2025-05-07T20:25:37.2532474Z 
2025-05-07T20:25:37.2532478Z 
2025-05-07T20:25:37.2532482Z 
2025-05-07T20:25:37.2534063Z 
2025-05-07T20:25:37.3222153Z cuda-nvvp-12.6.80    | 109.3 MB  | ####1      |  41% [A[A[A[A[A
2025-05-07T20:25:37.3222443Z 
2025-05-07T20:25:37.3222447Z 
2025-05-07T20:25:37.3222451Z 
2025-05-07T20:25:37.3222455Z 
2025-05-07T20:25:37.3222459Z 
2025-05-07T20:25:37.3227129Z 
2025-05-07T20:25:37.3390054Z libcusolver-11.7.1.2 | 95.8 MB   | ##7        |  27% [A[A[A[A[A[A
2025-05-07T20:25:37.3552662Z nsight-compute-2024. | 443.1 MB  | #####2     |  53% 
2025-05-07T20:25:37.3553017Z 
2025-05-07T20:25:37.3553028Z 
2025-05-07T20:25:37.3553172Z 
2025-05-07T20:25:37.3553179Z 
2025-05-07T20:25:37.3553189Z 
2025-05-07T20:25:37.3671321Z cuda-nvvp-12.6.80    | 109.3 MB  | ####4      |  44% [A[A[A[A[A
2025-05-07T20:25:37.3671699Z 
2025-05-07T20:25:37.4225622Z libcublas-12.6.4.1   | 256.2 MB  | #########9 | 100% [A
2025-05-07T20:25:37.4225992Z 
2025-05-07T20:25:37.4226019Z 
2025-05-07T20:25:37.4226030Z 
2025-05-07T20:25:37.4226045Z 
2025-05-07T20:25:37.4226052Z 
2025-05-07T20:25:37.4227408Z 
2025-05-07T20:25:37.4392878Z libcusolver-11.7.1.2 | 95.8 MB   | ##9        |  30% [A[A[A[A[A[A
2025-05-07T20:25:37.4553077Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:25:37.4553412Z 
2025-05-07T20:25:37.4553419Z 
2025-05-07T20:25:37.4553425Z 
2025-05-07T20:25:37.4553431Z 
2025-05-07T20:25:37.4553457Z 
2025-05-07T20:25:37.5232761Z cuda-nvvp-12.6.80    | 109.3 MB  | ####7      |  47% [A[A[A[A[A
2025-05-07T20:25:37.5233174Z 
2025-05-07T20:25:37.5233180Z 
2025-05-07T20:25:37.5233185Z 
2025-05-07T20:25:37.5233190Z 
2025-05-07T20:25:37.5233196Z 
2025-05-07T20:25:37.5233201Z 
2025-05-07T20:25:37.5395073Z libcusolver-11.7.1.2 | 95.8 MB   | ###2       |  33% [A[A[A[A[A[A
2025-05-07T20:25:37.5556791Z nsight-compute-2024. | 443.1 MB  | #####3     |  54% 
2025-05-07T20:25:37.5557149Z 
2025-05-07T20:25:37.5557155Z 
2025-05-07T20:25:37.5557409Z 
2025-05-07T20:25:37.5557416Z 
2025-05-07T20:25:37.5557421Z 
2025-05-07T20:25:37.6237198Z cuda-nvvp-12.6.80    | 109.3 MB  | #####      |  50% [A[A[A[A[A
2025-05-07T20:25:37.6237579Z 
2025-05-07T20:25:37.6237585Z 
2025-05-07T20:25:37.6237591Z 
2025-05-07T20:25:37.6237596Z 
2025-05-07T20:25:37.6237603Z 
2025-05-07T20:25:37.6238961Z 
2025-05-07T20:25:37.6395999Z libcusolver-11.7.1.2 | 95.8 MB   | ###5       |  36% [A[A[A[A[A[A
2025-05-07T20:25:37.6558327Z nsight-compute-2024. | 443.1 MB  | #####4     |  55% 
2025-05-07T20:25:37.6558949Z 
2025-05-07T20:25:37.6558955Z 
2025-05-07T20:25:37.6558960Z 
2025-05-07T20:25:37.6558965Z 
2025-05-07T20:25:37.6558975Z 
2025-05-07T20:25:37.7240166Z cuda-nvvp-12.6.80    | 109.3 MB  | #####3     |  53% [A[A[A[A[A
2025-05-07T20:25:37.7240569Z 
2025-05-07T20:25:37.7240574Z 
2025-05-07T20:25:37.7240580Z 
2025-05-07T20:25:37.7240585Z 
2025-05-07T20:25:37.7240590Z 
2025-05-07T20:25:37.7240595Z 
2025-05-07T20:25:37.7396922Z libcusolver-11.7.1.2 | 95.8 MB   | ###8       |  39% [A[A[A[A[A[A
2025-05-07T20:25:37.7564083Z nsight-compute-2024. | 443.1 MB  | #####5     |  55% 
2025-05-07T20:25:37.7564444Z 
2025-05-07T20:25:37.7564537Z 
2025-05-07T20:25:37.7564542Z 
2025-05-07T20:25:37.7564548Z 
2025-05-07T20:25:37.7568386Z 
2025-05-07T20:25:37.8245598Z cuda-nvvp-12.6.80    | 109.3 MB  | #####6     |  56% [A[A[A[A[A
2025-05-07T20:25:37.8246009Z 
2025-05-07T20:25:37.8246016Z 
2025-05-07T20:25:37.8246021Z 
2025-05-07T20:25:37.8246026Z 
2025-05-07T20:25:37.8246031Z 
2025-05-07T20:25:37.8246056Z 
2025-05-07T20:25:37.8400000Z libcusolver-11.7.1.2 | 95.8 MB   | ####1      |  42% [A[A[A[A[A[A
2025-05-07T20:25:37.8565678Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:25:37.8566035Z 
2025-05-07T20:25:37.8566040Z 
2025-05-07T20:25:37.8566046Z 
2025-05-07T20:25:37.8566051Z 
2025-05-07T20:25:37.8568429Z 
2025-05-07T20:25:37.9247004Z cuda-nvvp-12.6.80    | 109.3 MB  | #####9     |  59% [A[A[A[A[A
2025-05-07T20:25:37.9247415Z 
2025-05-07T20:25:37.9247436Z 
2025-05-07T20:25:37.9247441Z 
2025-05-07T20:25:37.9247446Z 
2025-05-07T20:25:37.9247452Z 
2025-05-07T20:25:37.9247457Z 
2025-05-07T20:25:37.9401294Z libcusolver-11.7.1.2 | 95.8 MB   | ####4      |  45% [A[A[A[A[A[A
2025-05-07T20:25:37.9612161Z nsight-compute-2024. | 443.1 MB  | #####6     |  57% 
2025-05-07T20:25:37.9612500Z 
2025-05-07T20:25:37.9612506Z 
2025-05-07T20:25:37.9612511Z 
2025-05-07T20:25:37.9612517Z 
2025-05-07T20:25:37.9614760Z 
2025-05-07T20:25:38.0258337Z cuda-nvvp-12.6.80    | 109.3 MB  | ######2    |  62% [A[A[A[A[A
2025-05-07T20:25:38.0258732Z 
2025-05-07T20:25:38.0258738Z 
2025-05-07T20:25:38.0258743Z 
2025-05-07T20:25:38.0258748Z 
2025-05-07T20:25:38.0258753Z 
2025-05-07T20:25:38.0265887Z 
2025-05-07T20:25:38.0420901Z libcusolver-11.7.1.2 | 95.8 MB   | ####7      |  48% [A[A[A[A[A[A
2025-05-07T20:25:38.0663478Z nsight-compute-2024. | 443.1 MB  | #####7     |  58% 
2025-05-07T20:25:38.0663825Z 
2025-05-07T20:25:38.0663831Z 
2025-05-07T20:25:38.0663836Z 
2025-05-07T20:25:38.0663855Z 
2025-05-07T20:25:38.0668703Z 
2025-05-07T20:25:38.1277296Z cuda-nvvp-12.6.80    | 109.3 MB  | ######4    |  65% [A[A[A[A[A
2025-05-07T20:25:38.1277684Z 
2025-05-07T20:25:38.1277690Z 
2025-05-07T20:25:38.1277695Z 
2025-05-07T20:25:38.1277701Z 
2025-05-07T20:25:38.1277706Z 
2025-05-07T20:25:38.1277711Z 
2025-05-07T20:25:38.1421572Z libcusolver-11.7.1.2 | 95.8 MB   | #####      |  51% [A[A[A[A[A[A
2025-05-07T20:25:38.1712915Z nsight-compute-2024. | 443.1 MB  | #####8     |  59% 
2025-05-07T20:25:38.1713308Z 
2025-05-07T20:25:38.1713313Z 
2025-05-07T20:25:38.1713319Z 
2025-05-07T20:25:38.1713324Z 
2025-05-07T20:25:38.1717102Z 
2025-05-07T20:25:38.2319352Z cuda-nvvp-12.6.80    | 109.3 MB  | ######7    |  68% [A[A[A[A[A
2025-05-07T20:25:38.2328697Z 
2025-05-07T20:25:38.2328704Z 
2025-05-07T20:25:38.2328710Z 
2025-05-07T20:25:38.2328715Z 
2025-05-07T20:25:38.2328720Z 
2025-05-07T20:25:38.2328725Z 
2025-05-07T20:25:38.2430324Z libcusolver-11.7.1.2 | 95.8 MB   | #####3     |  54% [A[A[A[A[A[A
2025-05-07T20:25:38.2765025Z nsight-compute-2024. | 443.1 MB  | #####9     |  59% 
2025-05-07T20:25:38.2765379Z 
2025-05-07T20:25:38.2765385Z 
2025-05-07T20:25:38.2765390Z 
2025-05-07T20:25:38.2765395Z 
2025-05-07T20:25:38.2766801Z 
2025-05-07T20:25:38.3336204Z cuda-nvvp-12.6.80    | 109.3 MB  | #######    |  71% [A[A[A[A[A
2025-05-07T20:25:38.3336595Z 
2025-05-07T20:25:38.3336601Z 
2025-05-07T20:25:38.3336611Z 
2025-05-07T20:25:38.3336616Z 
2025-05-07T20:25:38.3336621Z 
2025-05-07T20:25:38.3336627Z 
2025-05-07T20:25:38.3452001Z libcusolver-11.7.1.2 | 95.8 MB   | #####6     |  56% [A[A[A[A[A[A
2025-05-07T20:25:38.3766254Z nsight-compute-2024. | 443.1 MB  | ######     |  60% 
2025-05-07T20:25:38.3766643Z 
2025-05-07T20:25:38.3766918Z 
2025-05-07T20:25:38.3766954Z 
2025-05-07T20:25:38.3766960Z 
2025-05-07T20:25:38.3768637Z 
2025-05-07T20:25:38.4337786Z cuda-nvvp-12.6.80    | 109.3 MB  | #######3   |  74% [A[A[A[A[A
2025-05-07T20:25:38.4338183Z 
2025-05-07T20:25:38.4338199Z 
2025-05-07T20:25:38.4338223Z 
2025-05-07T20:25:38.4338228Z 
2025-05-07T20:25:38.4338234Z 
2025-05-07T20:25:38.4338239Z 
2025-05-07T20:25:38.4563161Z libcusolver-11.7.1.2 | 95.8 MB   | #####9     |  59% [A[A[A[A[A[A
2025-05-07T20:25:38.4768753Z nsight-compute-2024. | 443.1 MB  | ######     |  61% 
2025-05-07T20:25:38.4769112Z 
2025-05-07T20:25:38.4769222Z 
2025-05-07T20:25:38.4769228Z 
2025-05-07T20:25:38.4769231Z 
2025-05-07T20:25:38.4770502Z 
2025-05-07T20:25:38.5340230Z cuda-nvvp-12.6.80    | 109.3 MB  | #######6   |  77% [A[A[A[A[A
2025-05-07T20:25:38.5340649Z 
2025-05-07T20:25:38.5340654Z 
2025-05-07T20:25:38.5340659Z 
2025-05-07T20:25:38.5340665Z 
2025-05-07T20:25:38.5340670Z 
2025-05-07T20:25:38.5340675Z 
2025-05-07T20:25:38.5564740Z libcusolver-11.7.1.2 | 95.8 MB   | ######2    |  63% [A[A[A[A[A[A
2025-05-07T20:25:38.5783857Z nsight-compute-2024. | 443.1 MB  | ######1    |  62% 
2025-05-07T20:25:38.5784228Z 
2025-05-07T20:25:38.5784234Z 
2025-05-07T20:25:38.5784239Z 
2025-05-07T20:25:38.5784261Z 
2025-05-07T20:25:38.5784267Z 
2025-05-07T20:25:38.6438731Z cuda-nvvp-12.6.80    | 109.3 MB  | #######9   |  79% [A[A[A[A[A
2025-05-07T20:25:38.6439143Z 
2025-05-07T20:25:38.6439148Z 
2025-05-07T20:25:38.6439154Z 
2025-05-07T20:25:38.6439159Z 
2025-05-07T20:25:38.6439165Z 
2025-05-07T20:25:38.6442790Z 
2025-05-07T20:25:38.6565951Z libcusolver-11.7.1.2 | 95.8 MB   | ######5    |  66% [A[A[A[A[A[A
2025-05-07T20:25:38.6802721Z nsight-compute-2024. | 443.1 MB  | ######2    |  62% 
2025-05-07T20:25:38.6803016Z 
2025-05-07T20:25:38.6803234Z 
2025-05-07T20:25:38.6803239Z 
2025-05-07T20:25:38.6803245Z 
2025-05-07T20:25:38.6804891Z 
2025-05-07T20:25:38.7444189Z cuda-nvvp-12.6.80    | 109.3 MB  | ########2  |  82% [A[A[A[A[A
2025-05-07T20:25:38.7444590Z 
2025-05-07T20:25:38.7444596Z 
2025-05-07T20:25:38.7444601Z 
2025-05-07T20:25:38.7444607Z 
2025-05-07T20:25:38.7444612Z 
2025-05-07T20:25:38.7444618Z 
2025-05-07T20:25:38.7575826Z libcusolver-11.7.1.2 | 95.8 MB   | ######8    |  69% [A[A[A[A[A[A
2025-05-07T20:25:38.7805362Z nsight-compute-2024. | 443.1 MB  | ######3    |  63% 
2025-05-07T20:25:38.7805670Z 
2025-05-07T20:25:38.7805813Z 
2025-05-07T20:25:38.7805819Z 
2025-05-07T20:25:38.7805824Z 
2025-05-07T20:25:38.7807496Z 
2025-05-07T20:25:38.8445006Z cuda-nvvp-12.6.80    | 109.3 MB  | ########5  |  85% [A[A[A[A[A
2025-05-07T20:25:38.8445363Z 
2025-05-07T20:25:38.8445383Z 
2025-05-07T20:25:38.8445388Z 
2025-05-07T20:25:38.8445394Z 
2025-05-07T20:25:38.8445400Z 
2025-05-07T20:25:38.8447727Z 
2025-05-07T20:25:38.8663587Z libcusolver-11.7.1.2 | 95.8 MB   | #######2   |  72% [A[A[A[A[A[A
2025-05-07T20:25:38.8806860Z nsight-compute-2024. | 443.1 MB  | ######3    |  64% 
2025-05-07T20:25:38.8807192Z 
2025-05-07T20:25:38.8807198Z 
2025-05-07T20:25:38.8807204Z 
2025-05-07T20:25:38.8807209Z 
2025-05-07T20:25:38.8810348Z 
2025-05-07T20:25:38.8924072Z cuda-nvvp-12.6.80    | 109.3 MB  | ########8  |  88% [A[A[A[A[A
2025-05-07T20:25:38.8924357Z 
2025-05-07T20:25:38.8931834Z 
2025-05-07T20:25:38.9322373Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:38.9322890Z 
2025-05-07T20:25:38.9322896Z 
2025-05-07T20:25:38.9322902Z 
2025-05-07T20:25:38.9322905Z 
2025-05-07T20:25:38.9322909Z 
2025-05-07T20:25:38.9322913Z 
2025-05-07T20:25:38.9324153Z 
2025-05-07T20:25:38.9488220Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:38.9488570Z 
2025-05-07T20:25:38.9488574Z 
2025-05-07T20:25:38.9488578Z 
2025-05-07T20:25:38.9488582Z 
2025-05-07T20:25:38.9488824Z 
2025-05-07T20:25:38.9488837Z 
2025-05-07T20:25:38.9935601Z libcusolver-11.7.1.2 | 95.8 MB   | #######5   |  75% [A[A[A[A[A[A
2025-05-07T20:25:38.9982717Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:25:38.9982986Z 
2025-05-07T20:25:38.9982990Z 
2025-05-07T20:25:38.9982994Z 
2025-05-07T20:25:38.9982997Z 
2025-05-07T20:25:38.9983001Z 
2025-05-07T20:25:39.0322954Z cuda-nvvp-12.6.80    | 109.3 MB  | #########1 |  91% [A[A[A[A[A
2025-05-07T20:25:39.0323384Z 
2025-05-07T20:25:39.0323391Z 
2025-05-07T20:25:39.0323396Z 
2025-05-07T20:25:39.0323402Z 
2025-05-07T20:25:39.0323407Z 
2025-05-07T20:25:39.0323412Z 
2025-05-07T20:25:39.0325408Z 
2025-05-07T20:25:39.0577318Z libnpp-12.3.1.54     | 93.4 MB   | 2          |   3% [A[A[A[A[A[A[A
2025-05-07T20:25:39.0577622Z 
2025-05-07T20:25:39.0577633Z 
2025-05-07T20:25:39.0577638Z 
2025-05-07T20:25:39.0577643Z 
2025-05-07T20:25:39.0577648Z 
2025-05-07T20:25:39.0581926Z 
2025-05-07T20:25:39.1070939Z libcusolver-11.7.1.2 | 95.8 MB   | #######8   |  78% [A[A[A[A[A[A
2025-05-07T20:25:39.1071415Z 
2025-05-07T20:25:39.1071421Z 
2025-05-07T20:25:39.1071428Z 
2025-05-07T20:25:39.1071433Z 
2025-05-07T20:25:39.1075275Z 
2025-05-07T20:25:39.1160657Z cuda-nvvp-12.6.80    | 109.3 MB  | #########3 |  94% [A[A[A[A[A
2025-05-07T20:25:39.1326560Z nsight-compute-2024. | 443.1 MB  | ######5    |  65% 
2025-05-07T20:25:39.1326861Z 
2025-05-07T20:25:39.1327052Z 
2025-05-07T20:25:39.1327058Z 
2025-05-07T20:25:39.1327062Z 
2025-05-07T20:25:39.1327082Z 
2025-05-07T20:25:39.1327086Z 
2025-05-07T20:25:39.1328340Z 
2025-05-07T20:25:39.1752216Z libnpp-12.3.1.54     | 93.4 MB   | 5          |   5% [A[A[A[A[A[A[A
2025-05-07T20:25:39.1752524Z 
2025-05-07T20:25:39.1752530Z 
2025-05-07T20:25:39.1752534Z 
2025-05-07T20:25:39.1752537Z 
2025-05-07T20:25:39.1752541Z 
2025-05-07T20:25:39.1757173Z 
2025-05-07T20:25:39.2165802Z libcusolver-11.7.1.2 | 95.8 MB   | ########1  |  81% [A[A[A[A[A[A
2025-05-07T20:25:39.2166239Z 
2025-05-07T20:25:39.2166275Z 
2025-05-07T20:25:39.2166281Z 
2025-05-07T20:25:39.2166286Z 
2025-05-07T20:25:39.2169480Z 
2025-05-07T20:25:39.2298562Z cuda-nvvp-12.6.80    | 109.3 MB  | #########6 |  97% [A[A[A[A[A
2025-05-07T20:25:39.2327553Z nsight-compute-2024. | 443.1 MB  | ######6    |  66% 
2025-05-07T20:25:39.2327924Z 
2025-05-07T20:25:39.2327938Z 
2025-05-07T20:25:39.2327942Z 
2025-05-07T20:25:39.2327946Z 
2025-05-07T20:25:39.2327949Z 
2025-05-07T20:25:39.2327953Z 
2025-05-07T20:25:39.2327957Z 
2025-05-07T20:25:39.2794133Z libnpp-12.3.1.54     | 93.4 MB   | 8          |   8% [A[A[A[A[A[A[A
2025-05-07T20:25:39.2794492Z 
2025-05-07T20:25:39.2794502Z 
2025-05-07T20:25:39.2794506Z 
2025-05-07T20:25:39.2794510Z 
2025-05-07T20:25:39.2794514Z 
2025-05-07T20:25:39.2796464Z 
2025-05-07T20:25:39.3249211Z libcusolver-11.7.1.2 | 95.8 MB   | ########4  |  84% [A[A[A[A[A[A
2025-05-07T20:25:39.3249614Z 
2025-05-07T20:25:39.3249619Z 
2025-05-07T20:25:39.3249622Z 
2025-05-07T20:25:39.3249626Z 
2025-05-07T20:25:39.3254188Z 
2025-05-07T20:25:39.3334221Z cuda-nvvp-12.6.80    | 109.3 MB  | #########9 |  99% [A[A[A[A[A
2025-05-07T20:25:39.3334618Z 
2025-05-07T20:25:39.3334624Z 
2025-05-07T20:25:39.3334630Z 
2025-05-07T20:25:39.3334635Z 
2025-05-07T20:25:39.3334640Z 
2025-05-07T20:25:39.3334645Z 
2025-05-07T20:25:39.3336293Z 
2025-05-07T20:25:39.3394313Z libnpp-12.3.1.54     | 93.4 MB   | #1         |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:39.3802049Z nsight-compute-2024. | 443.1 MB  | ######6    |  67% 
2025-05-07T20:25:39.3803290Z 
2025-05-07T20:25:39.3803299Z 
2025-05-07T20:25:39.3803304Z 
2025-05-07T20:25:39.3803310Z 
2025-05-07T20:25:39.3803315Z 
2025-05-07T20:25:39.3803324Z 
2025-05-07T20:25:39.4332127Z libcusolver-11.7.1.2 | 95.8 MB   | ########7  |  87% [A[A[A[A[A[A
2025-05-07T20:25:39.4332534Z 
2025-05-07T20:25:39.4332539Z 
2025-05-07T20:25:39.4332544Z 
2025-05-07T20:25:39.4334220Z 
2025-05-07T20:25:39.4335143Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:39.4335424Z 
2025-05-07T20:25:39.4335662Z 
2025-05-07T20:25:39.4335671Z 
2025-05-07T20:25:39.4335675Z 
2025-05-07T20:25:39.4335679Z 
2025-05-07T20:25:39.4335682Z 
2025-05-07T20:25:39.4343095Z 
2025-05-07T20:25:39.4449959Z libnpp-12.3.1.54     | 93.4 MB   | #4         |  14% [A[A[A[A[A[A[A
2025-05-07T20:25:39.4802911Z nsight-compute-2024. | 443.1 MB  | ######7    |  67% 
2025-05-07T20:25:39.4803198Z 
2025-05-07T20:25:39.4803202Z 
2025-05-07T20:25:39.4803205Z 
2025-05-07T20:25:39.4803209Z 
2025-05-07T20:25:39.4803226Z 
2025-05-07T20:25:39.4805962Z 
2025-05-07T20:25:39.5339394Z libcusolver-11.7.1.2 | 95.8 MB   | #########  |  91% [A[A[A[A[A[A
2025-05-07T20:25:39.5339809Z 
2025-05-07T20:25:39.5339814Z 
2025-05-07T20:25:39.5339818Z 
2025-05-07T20:25:39.5339822Z 
2025-05-07T20:25:39.5339826Z 
2025-05-07T20:25:39.5339829Z 
2025-05-07T20:25:39.5339957Z 
2025-05-07T20:25:39.5452727Z libnpp-12.3.1.54     | 93.4 MB   | #7         |  17% [A[A[A[A[A[A[A
2025-05-07T20:25:39.5802903Z nsight-compute-2024. | 443.1 MB  | ######8    |  68% 
2025-05-07T20:25:39.5803180Z 
2025-05-07T20:25:39.5803184Z 
2025-05-07T20:25:39.5803188Z 
2025-05-07T20:25:39.5803191Z 
2025-05-07T20:25:39.5803195Z 
2025-05-07T20:25:39.5808956Z 
2025-05-07T20:25:39.6343008Z libcusolver-11.7.1.2 | 95.8 MB   | #########4 |  94% [A[A[A[A[A[A
2025-05-07T20:25:39.6343308Z 
2025-05-07T20:25:39.6343312Z 
2025-05-07T20:25:39.6343316Z 
2025-05-07T20:25:39.6343320Z 
2025-05-07T20:25:39.6343324Z 
2025-05-07T20:25:39.6343328Z 
2025-05-07T20:25:39.6344822Z 
2025-05-07T20:25:39.6454005Z libnpp-12.3.1.54     | 93.4 MB   | ##         |  21% [A[A[A[A[A[A[A
2025-05-07T20:25:39.6907269Z nsight-compute-2024. | 443.1 MB  | ######8    |  69% 
2025-05-07T20:25:39.6907524Z 
2025-05-07T20:25:39.6907529Z 
2025-05-07T20:25:39.6907541Z 
2025-05-07T20:25:39.6907545Z 
2025-05-07T20:25:39.6907549Z 
2025-05-07T20:25:39.6909917Z 
2025-05-07T20:25:39.7346327Z libcusolver-11.7.1.2 | 95.8 MB   | #########7 |  97% [A[A[A[A[A[A
2025-05-07T20:25:39.7346734Z 
2025-05-07T20:25:39.7346751Z 
2025-05-07T20:25:39.7346755Z 
2025-05-07T20:25:39.7346759Z 
2025-05-07T20:25:39.7346763Z 
2025-05-07T20:25:39.7346767Z 
2025-05-07T20:25:39.7349651Z 
2025-05-07T20:25:39.7505193Z libnpp-12.3.1.54     | 93.4 MB   | ##4        |  24% [A[A[A[A[A[A[A
2025-05-07T20:25:39.8347410Z nsight-compute-2024. | 443.1 MB  | ######9    |  69% 
2025-05-07T20:25:39.8347704Z 
2025-05-07T20:25:39.8347710Z 
2025-05-07T20:25:39.8347715Z 
2025-05-07T20:25:39.8347737Z 
2025-05-07T20:25:39.8347754Z 
2025-05-07T20:25:39.8347760Z 
2025-05-07T20:25:39.8350796Z 
2025-05-07T20:25:39.8506109Z libnpp-12.3.1.54     | 93.4 MB   | ##7        |  28% [A[A[A[A[A[A[A
2025-05-07T20:25:39.9348981Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:25:39.9349326Z 
2025-05-07T20:25:39.9349330Z 
2025-05-07T20:25:39.9349334Z 
2025-05-07T20:25:39.9349337Z 
2025-05-07T20:25:39.9349341Z 
2025-05-07T20:25:39.9349352Z 
2025-05-07T20:25:39.9352610Z 
2025-05-07T20:25:39.9524703Z libnpp-12.3.1.54     | 93.4 MB   | ###1       |  31% [A[A[A[A[A[A[A
2025-05-07T20:25:40.0353884Z nsight-compute-2024. | 443.1 MB  | #######    |  71% 
2025-05-07T20:25:40.0354245Z 
2025-05-07T20:25:40.0354252Z 
2025-05-07T20:25:40.0354257Z 
2025-05-07T20:25:40.0354262Z 
2025-05-07T20:25:40.0354267Z 
2025-05-07T20:25:40.0354282Z 
2025-05-07T20:25:40.0355857Z 
2025-05-07T20:25:40.0936951Z libnpp-12.3.1.54     | 93.4 MB   | ###5       |  35% [A[A[A[A[A[A[A
2025-05-07T20:25:40.1356171Z nsight-compute-2024. | 443.1 MB  | #######1   |  72% 
2025-05-07T20:25:40.1356522Z 
2025-05-07T20:25:40.1356528Z 
2025-05-07T20:25:40.1356533Z 
2025-05-07T20:25:40.1356538Z 
2025-05-07T20:25:40.1356543Z 
2025-05-07T20:25:40.1356548Z 
2025-05-07T20:25:40.1358247Z 
2025-05-07T20:25:40.2358049Z libnpp-12.3.1.54     | 93.4 MB   | ###9       |  40% [A[A[A[A[A[A[A
2025-05-07T20:25:40.2358438Z 
2025-05-07T20:25:40.2358443Z 
2025-05-07T20:25:40.2358449Z 
2025-05-07T20:25:40.2358454Z 
2025-05-07T20:25:40.2358469Z 
2025-05-07T20:25:40.2358475Z 
2025-05-07T20:25:40.2358909Z 
2025-05-07T20:25:40.2406398Z libnpp-12.3.1.54     | 93.4 MB   | ####3      |  44% [A[A[A[A[A[A[A
2025-05-07T20:25:40.3362144Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:25:40.3362514Z 
2025-05-07T20:25:40.3362519Z 
2025-05-07T20:25:40.3362525Z 
2025-05-07T20:25:40.3362530Z 
2025-05-07T20:25:40.3362535Z 
2025-05-07T20:25:40.3362540Z 
2025-05-07T20:25:40.3364015Z 
2025-05-07T20:25:40.4363532Z libnpp-12.3.1.54     | 93.4 MB   | ####8      |  49% [A[A[A[A[A[A[A
2025-05-07T20:25:40.4364066Z 
2025-05-07T20:25:40.4364070Z 
2025-05-07T20:25:40.4364074Z 
2025-05-07T20:25:40.4364078Z 
2025-05-07T20:25:40.4364089Z 
2025-05-07T20:25:40.4364093Z 
2025-05-07T20:25:40.4370127Z 
2025-05-07T20:25:40.4424209Z libnpp-12.3.1.54     | 93.4 MB   | #####2     |  53% [A[A[A[A[A[A[A
2025-05-07T20:25:40.5428810Z nsight-compute-2024. | 443.1 MB  | #######2   |  73% 
2025-05-07T20:25:40.5504473Z nsight-compute-2024. | 443.1 MB  | #######3   |  73% 
2025-05-07T20:25:40.5504840Z 
2025-05-07T20:25:40.5504878Z 
2025-05-07T20:25:40.5504884Z 
2025-05-07T20:25:40.5504889Z 
2025-05-07T20:25:40.5504894Z 
2025-05-07T20:25:40.5504899Z 
2025-05-07T20:25:40.5504904Z 
2025-05-07T20:25:40.6433004Z libnpp-12.3.1.54     | 93.4 MB   | #####7     |  57% [A[A[A[A[A[A[A
2025-05-07T20:25:40.6667493Z nsight-compute-2024. | 443.1 MB  | #######4   |  74% 
2025-05-07T20:25:40.6667902Z 
2025-05-07T20:25:40.6667908Z 
2025-05-07T20:25:40.6667914Z 
2025-05-07T20:25:40.6667935Z 
2025-05-07T20:25:40.6667973Z 
2025-05-07T20:25:40.6667977Z 
2025-05-07T20:25:40.6669253Z 
2025-05-07T20:25:40.7673174Z libnpp-12.3.1.54     | 93.4 MB   | ######1    |  61% [A[A[A[A[A[A[A
2025-05-07T20:25:40.7673565Z 
2025-05-07T20:25:40.7673569Z 
2025-05-07T20:25:40.7673573Z 
2025-05-07T20:25:40.7673577Z 
2025-05-07T20:25:40.7673581Z 
2025-05-07T20:25:40.7673585Z 
2025-05-07T20:25:40.7674790Z 
2025-05-07T20:25:40.7749596Z libnpp-12.3.1.54     | 93.4 MB   | ######5    |  66% [A[A[A[A[A[A[A
2025-05-07T20:25:40.8710585Z nsight-compute-2024. | 443.1 MB  | #######4   |  75% 
2025-05-07T20:25:40.8710962Z 
2025-05-07T20:25:40.8710968Z 
2025-05-07T20:25:40.8710973Z 
2025-05-07T20:25:40.8710989Z 
2025-05-07T20:25:40.8710994Z 
2025-05-07T20:25:40.8710999Z 
2025-05-07T20:25:40.8711005Z 
2025-05-07T20:25:40.8753953Z libnpp-12.3.1.54     | 93.4 MB   | ######9    |  70% [A[A[A[A[A[A[A
2025-05-07T20:25:40.9710778Z nsight-compute-2024. | 443.1 MB  | #######5   |  76% 
2025-05-07T20:25:40.9711056Z 
2025-05-07T20:25:40.9711080Z 
2025-05-07T20:25:40.9711084Z 
2025-05-07T20:25:40.9711087Z 
2025-05-07T20:25:40.9711091Z 
2025-05-07T20:25:40.9711096Z 
2025-05-07T20:25:40.9711100Z 
2025-05-07T20:25:41.0246560Z libnpp-12.3.1.54     | 93.4 MB   | #######4   |  74% [A[A[A[A[A[A[A
2025-05-07T20:25:41.0792679Z nsight-compute-2024. | 443.1 MB  | #######6   |  76% 
2025-05-07T20:25:41.0792948Z 
2025-05-07T20:25:41.0792953Z 
2025-05-07T20:25:41.0792956Z 
2025-05-07T20:25:41.0792960Z 
2025-05-07T20:25:41.0792964Z 
2025-05-07T20:25:41.0792994Z 
2025-05-07T20:25:41.0797872Z 
2025-05-07T20:25:41.1796023Z libnpp-12.3.1.54     | 93.4 MB   | #######8   |  79% [A[A[A[A[A[A[A
2025-05-07T20:25:41.1796471Z 
2025-05-07T20:25:41.1796488Z 
2025-05-07T20:25:41.1796495Z 
2025-05-07T20:25:41.1796501Z 
2025-05-07T20:25:41.1796508Z 
2025-05-07T20:25:41.1796515Z 
2025-05-07T20:25:41.1796520Z 
2025-05-07T20:25:41.2797645Z libnpp-12.3.1.54     | 93.4 MB   | ########3  |  83% [A[A[A[A[A[A[A
2025-05-07T20:25:41.2797967Z 
2025-05-07T20:25:41.2798248Z 
2025-05-07T20:25:41.2798253Z 
2025-05-07T20:25:41.2798257Z 
2025-05-07T20:25:41.2798261Z 
2025-05-07T20:25:41.2798264Z 
2025-05-07T20:25:41.2798297Z 
2025-05-07T20:25:41.3804703Z libnpp-12.3.1.54     | 93.4 MB   | ########8  |  89% [A[A[A[A[A[A[A
2025-05-07T20:25:41.3805082Z 
2025-05-07T20:25:41.3805091Z 
2025-05-07T20:25:41.3805098Z 
2025-05-07T20:25:41.3805108Z 
2025-05-07T20:25:41.3805132Z 
2025-05-07T20:25:41.3805138Z 
2025-05-07T20:25:41.3805144Z 
2025-05-07T20:25:41.4120964Z libnpp-12.3.1.54     | 93.4 MB   | #########3 |  93% [A[A[A[A[A[A[A
2025-05-07T20:25:41.5011942Z nsight-compute-2024. | 443.1 MB  | #######6   |  77% 
2025-05-07T20:25:41.5012232Z 
2025-05-07T20:25:41.5012237Z 
2025-05-07T20:25:41.5012240Z 
2025-05-07T20:25:41.5012244Z 
2025-05-07T20:25:41.5012248Z 
2025-05-07T20:25:41.5012253Z 
2025-05-07T20:25:41.5012447Z 
2025-05-07T20:25:41.5597027Z libnpp-12.3.1.54     | 93.4 MB   | #########7 |  98% [A[A[A[A[A[A[A
2025-05-07T20:25:41.6600816Z nsight-compute-2024. | 443.1 MB  | #######7   |  77% 
2025-05-07T20:25:41.7603570Z nsight-compute-2024. | 443.1 MB  | #######7   |  78% 
2025-05-07T20:25:41.8604642Z nsight-compute-2024. | 443.1 MB  | #######8   |  79% 
2025-05-07T20:25:41.9604825Z nsight-compute-2024. | 443.1 MB  | #######9   |  80% 
2025-05-07T20:25:42.0607119Z nsight-compute-2024. | 443.1 MB  | ########   |  81% 
2025-05-07T20:25:42.1613435Z nsight-compute-2024. | 443.1 MB  | ########1  |  81% 
2025-05-07T20:25:42.2626085Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:25:42.3666882Z nsight-compute-2024. | 443.1 MB  | ########3  |  83% 
2025-05-07T20:25:42.4681254Z nsight-compute-2024. | 443.1 MB  | ########4  |  84% 
2025-05-07T20:25:42.5389158Z nsight-compute-2024. | 443.1 MB  | ########4  |  85% 
2025-05-07T20:25:42.5389538Z 
2025-05-07T20:25:42.5389543Z 
2025-05-07T20:25:42.5389549Z 
2025-05-07T20:25:42.5389554Z 
2025-05-07T20:25:42.5389570Z 
2025-05-07T20:25:42.5398735Z 
2025-05-07T20:25:42.5732736Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:42.6029454Z nsight-compute-2024. | 443.1 MB  | ########5  |  86% 
2025-05-07T20:25:42.6029824Z 
2025-05-07T20:25:42.6029830Z 
2025-05-07T20:25:42.6029836Z 
2025-05-07T20:25:42.6029840Z 
2025-05-07T20:25:42.6029847Z 
2025-05-07T20:25:42.6029852Z 
2025-05-07T20:25:42.6029857Z 
2025-05-07T20:25:42.6031151Z 
2025-05-07T20:25:42.6120618Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.6120936Z 
2025-05-07T20:25:42.6120966Z 
2025-05-07T20:25:42.6120970Z 
2025-05-07T20:25:42.6120974Z 
2025-05-07T20:25:42.6125323Z 
2025-05-07T20:25:42.6631645Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:42.6631944Z 
2025-05-07T20:25:42.6631949Z 
2025-05-07T20:25:42.6631952Z 
2025-05-07T20:25:42.6631956Z 
2025-05-07T20:25:42.6631960Z 
2025-05-07T20:25:42.6631964Z 
2025-05-07T20:25:42.6631968Z 
2025-05-07T20:25:42.6631972Z 
2025-05-07T20:25:42.6632715Z 
2025-05-07T20:25:42.6893363Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7031933Z nsight-compute-2024. | 443.1 MB  | ########6  |  87% 
2025-05-07T20:25:42.7032324Z 
2025-05-07T20:25:42.7032330Z 
2025-05-07T20:25:42.7032335Z 
2025-05-07T20:25:42.7032340Z 
2025-05-07T20:25:42.7032346Z 
2025-05-07T20:25:42.7032351Z 
2025-05-07T20:25:42.7032356Z 
2025-05-07T20:25:42.7039893Z 
2025-05-07T20:25:42.7631418Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 6          |   7% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7631799Z 
2025-05-07T20:25:42.7631805Z 
2025-05-07T20:25:42.7631811Z 
2025-05-07T20:25:42.7631816Z 
2025-05-07T20:25:42.7631821Z 
2025-05-07T20:25:42.7631826Z 
2025-05-07T20:25:42.7631831Z 
2025-05-07T20:25:42.7631836Z 
2025-05-07T20:25:42.7633363Z 
2025-05-07T20:25:42.8125211Z libcurand-10.3.7.77  | 39.9 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8125536Z 
2025-05-07T20:25:42.8125540Z 
2025-05-07T20:25:42.8125544Z 
2025-05-07T20:25:42.8125557Z 
2025-05-07T20:25:42.8125881Z 
2025-05-07T20:25:42.8125890Z 
2025-05-07T20:25:42.8125895Z 
2025-05-07T20:25:42.8125901Z 
2025-05-07T20:25:42.8275872Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #3         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8671720Z nsight-compute-2024. | 443.1 MB  | ########7  |  87% 
2025-05-07T20:25:42.8672071Z 
2025-05-07T20:25:42.8672076Z 
2025-05-07T20:25:42.8672081Z 
2025-05-07T20:25:42.8672087Z 
2025-05-07T20:25:42.8672092Z 
2025-05-07T20:25:42.8672097Z 
2025-05-07T20:25:42.8672110Z 
2025-05-07T20:25:42.8672428Z 
2025-05-07T20:25:42.8673122Z 
2025-05-07T20:25:42.9197772Z libcurand-10.3.7.77  | 39.9 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9198180Z 
2025-05-07T20:25:42.9198198Z 
2025-05-07T20:25:42.9198203Z 
2025-05-07T20:25:42.9198208Z 
2025-05-07T20:25:42.9198213Z 
2025-05-07T20:25:42.9198218Z 
2025-05-07T20:25:42.9198223Z 
2025-05-07T20:25:42.9200676Z 
2025-05-07T20:25:42.9437036Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #9         |  19% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9671845Z nsight-compute-2024. | 443.1 MB  | ########8  |  88% 
2025-05-07T20:25:42.9672130Z 
2025-05-07T20:25:42.9672134Z 
2025-05-07T20:25:42.9672138Z 
2025-05-07T20:25:42.9672142Z 
2025-05-07T20:25:42.9672153Z 
2025-05-07T20:25:42.9672157Z 
2025-05-07T20:25:42.9672160Z 
2025-05-07T20:25:42.9672164Z 
2025-05-07T20:25:42.9673472Z 
2025-05-07T20:25:43.0260569Z libcurand-10.3.7.77  | 39.9 MB   | ##1        |  22% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0260914Z 
2025-05-07T20:25:43.0260919Z 
2025-05-07T20:25:43.0260954Z 
2025-05-07T20:25:43.0260966Z 
2025-05-07T20:25:43.0260969Z 
2025-05-07T20:25:43.0260973Z 
2025-05-07T20:25:43.0260977Z 
2025-05-07T20:25:43.0260981Z 
2025-05-07T20:25:43.0630949Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##5        |  26% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0672477Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:25:43.0672803Z 
2025-05-07T20:25:43.0672818Z 
2025-05-07T20:25:43.0672822Z 
2025-05-07T20:25:43.0672826Z 
2025-05-07T20:25:43.0672928Z 
2025-05-07T20:25:43.0672933Z 
2025-05-07T20:25:43.0672936Z 
2025-05-07T20:25:43.0672940Z 
2025-05-07T20:25:43.0672943Z 
2025-05-07T20:25:43.1334749Z libcurand-10.3.7.77  | 39.9 MB   | ##8        |  29% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1335074Z 
2025-05-07T20:25:43.1335078Z 
2025-05-07T20:25:43.1335082Z 
2025-05-07T20:25:43.1335086Z 
2025-05-07T20:25:43.1335090Z 
2025-05-07T20:25:43.1335094Z 
2025-05-07T20:25:43.1335098Z 
2025-05-07T20:25:43.1335928Z 
2025-05-07T20:25:43.1632355Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1674510Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:25:43.1674863Z 
2025-05-07T20:25:43.1674867Z 
2025-05-07T20:25:43.1674871Z 
2025-05-07T20:25:43.1674874Z 
2025-05-07T20:25:43.1674878Z 
2025-05-07T20:25:43.1674882Z 
2025-05-07T20:25:43.1674888Z 
2025-05-07T20:25:43.1674892Z 
2025-05-07T20:25:43.1674896Z 
2025-05-07T20:25:43.2396211Z libcurand-10.3.7.77  | 39.9 MB   | ###6       |  37% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2396630Z 
2025-05-07T20:25:43.2396635Z 
2025-05-07T20:25:43.2396640Z 
2025-05-07T20:25:43.2396645Z 
2025-05-07T20:25:43.2396651Z 
2025-05-07T20:25:43.2396656Z 
2025-05-07T20:25:43.2396662Z 
2025-05-07T20:25:43.2398643Z 
2025-05-07T20:25:43.2670140Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###7       |  37% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2679040Z nsight-compute-2024. | 443.1 MB  | #########  |  90% 
2025-05-07T20:25:43.2679757Z 
2025-05-07T20:25:43.2679763Z 
2025-05-07T20:25:43.2679792Z 
2025-05-07T20:25:43.2679798Z 
2025-05-07T20:25:43.2679803Z 
2025-05-07T20:25:43.2679808Z 
2025-05-07T20:25:43.2679813Z 
2025-05-07T20:25:43.2679819Z 
2025-05-07T20:25:43.2679843Z 
2025-05-07T20:25:43.3399377Z libcurand-10.3.7.77  | 39.9 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3399842Z 
2025-05-07T20:25:43.3399848Z 
2025-05-07T20:25:43.3399854Z 
2025-05-07T20:25:43.3399859Z 
2025-05-07T20:25:43.3399865Z 
2025-05-07T20:25:43.3400178Z 
2025-05-07T20:25:43.3400187Z 
2025-05-07T20:25:43.3401980Z 
2025-05-07T20:25:43.3686879Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####3      |  43% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3691860Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:25:43.3692118Z 
2025-05-07T20:25:43.3692122Z 
2025-05-07T20:25:43.3692126Z 
2025-05-07T20:25:43.3692134Z 
2025-05-07T20:25:43.3692139Z 
2025-05-07T20:25:43.3692145Z 
2025-05-07T20:25:43.3692150Z 
2025-05-07T20:25:43.3692166Z 
2025-05-07T20:25:43.3693646Z 
2025-05-07T20:25:43.4405782Z libcurand-10.3.7.77  | 39.9 MB   | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4406142Z 
2025-05-07T20:25:43.4406147Z 
2025-05-07T20:25:43.4406151Z 
2025-05-07T20:25:43.4406155Z 
2025-05-07T20:25:43.4406166Z 
2025-05-07T20:25:43.4406170Z 
2025-05-07T20:25:43.4406173Z 
2025-05-07T20:25:43.4407469Z 
2025-05-07T20:25:43.4687880Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####      |  50% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4826054Z nsight-compute-2024. | 443.1 MB  | #########1 |  92% 
2025-05-07T20:25:43.4826404Z 
2025-05-07T20:25:43.4826408Z 
2025-05-07T20:25:43.4826412Z 
2025-05-07T20:25:43.4826416Z 
2025-05-07T20:25:43.4826419Z 
2025-05-07T20:25:43.4826423Z 
2025-05-07T20:25:43.4826435Z 
2025-05-07T20:25:43.4826439Z 
2025-05-07T20:25:43.4830319Z 
2025-05-07T20:25:43.5475490Z libcurand-10.3.7.77  | 39.9 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5475860Z 
2025-05-07T20:25:43.5475878Z 
2025-05-07T20:25:43.5475884Z 
2025-05-07T20:25:43.5475923Z 
2025-05-07T20:25:43.5475929Z 
2025-05-07T20:25:43.5475934Z 
2025-05-07T20:25:43.5475940Z 
2025-05-07T20:25:43.5477740Z 
2025-05-07T20:25:43.5687727Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####6     |  56% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5831105Z nsight-compute-2024. | 443.1 MB  | #########2 |  92% 
2025-05-07T20:25:43.5831591Z 
2025-05-07T20:25:43.5831597Z 
2025-05-07T20:25:43.5831612Z 
2025-05-07T20:25:43.5831618Z 
2025-05-07T20:25:43.5831623Z 
2025-05-07T20:25:43.5831653Z 
2025-05-07T20:25:43.5831661Z 
2025-05-07T20:25:43.5831667Z 
2025-05-07T20:25:43.5834918Z 
2025-05-07T20:25:43.6475949Z libcurand-10.3.7.77  | 39.9 MB   | ######7    |  67% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6476280Z 
2025-05-07T20:25:43.6476284Z 
2025-05-07T20:25:43.6476287Z 
2025-05-07T20:25:43.6476291Z 
2025-05-07T20:25:43.6476295Z 
2025-05-07T20:25:43.6476298Z 
2025-05-07T20:25:43.6476303Z 
2025-05-07T20:25:43.6476309Z 
2025-05-07T20:25:43.6688199Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######2    |  63% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6899802Z nsight-compute-2024. | 443.1 MB  | #########2 |  93% 
2025-05-07T20:25:43.6900117Z 
2025-05-07T20:25:43.6900121Z 
2025-05-07T20:25:43.6900125Z 
2025-05-07T20:25:43.6900129Z 
2025-05-07T20:25:43.6900132Z 
2025-05-07T20:25:43.6900136Z 
2025-05-07T20:25:43.6900140Z 
2025-05-07T20:25:43.6900144Z 
2025-05-07T20:25:43.6902587Z 
2025-05-07T20:25:43.7482681Z libcurand-10.3.7.77  | 39.9 MB   | #######4   |  75% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7483199Z 
2025-05-07T20:25:43.7483206Z 
2025-05-07T20:25:43.7483211Z 
2025-05-07T20:25:43.7483216Z 
2025-05-07T20:25:43.7483221Z 
2025-05-07T20:25:43.7483227Z 
2025-05-07T20:25:43.7483235Z 
2025-05-07T20:25:43.7483242Z 
2025-05-07T20:25:43.7694707Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######9    |  69% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7902386Z nsight-compute-2024. | 443.1 MB  | #########3 |  94% 
2025-05-07T20:25:43.7902687Z 
2025-05-07T20:25:43.7902691Z 
2025-05-07T20:25:43.7902694Z 
2025-05-07T20:25:43.7902718Z 
2025-05-07T20:25:43.7902722Z 
2025-05-07T20:25:43.7902726Z 
2025-05-07T20:25:43.7902730Z 
2025-05-07T20:25:43.7902733Z 
2025-05-07T20:25:43.7907405Z 
2025-05-07T20:25:43.8483429Z libcurand-10.3.7.77  | 39.9 MB   | ########2  |  82% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8484021Z 
2025-05-07T20:25:43.8484027Z 
2025-05-07T20:25:43.8484033Z 
2025-05-07T20:25:43.8484038Z 
2025-05-07T20:25:43.8484043Z 
2025-05-07T20:25:43.8484048Z 
2025-05-07T20:25:43.8484053Z 
2025-05-07T20:25:43.8488692Z 
2025-05-07T20:25:43.8700501Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######5   |  76% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8980019Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:25:43.8980374Z 
2025-05-07T20:25:43.8980380Z 
2025-05-07T20:25:43.8980386Z 
2025-05-07T20:25:43.8980391Z 
2025-05-07T20:25:43.8980396Z 
2025-05-07T20:25:43.8980401Z 
2025-05-07T20:25:43.8980407Z 
2025-05-07T20:25:43.8980412Z 
2025-05-07T20:25:43.8982108Z 
2025-05-07T20:25:43.9488915Z libcurand-10.3.7.77  | 39.9 MB   | ########9  |  90% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9489467Z 
2025-05-07T20:25:43.9489471Z 
2025-05-07T20:25:43.9489483Z 
2025-05-07T20:25:43.9489487Z 
2025-05-07T20:25:43.9489491Z 
2025-05-07T20:25:43.9489494Z 
2025-05-07T20:25:43.9489498Z 
2025-05-07T20:25:43.9490577Z 
2025-05-07T20:25:43.9702583Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########1  |  82% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9982104Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:25:43.9982434Z 
2025-05-07T20:25:43.9982439Z 
2025-05-07T20:25:43.9982443Z 
2025-05-07T20:25:43.9982447Z 
2025-05-07T20:25:43.9982451Z 
2025-05-07T20:25:43.9982455Z 
2025-05-07T20:25:43.9982458Z 
2025-05-07T20:25:43.9982462Z 
2025-05-07T20:25:43.9982466Z 
2025-05-07T20:25:44.0492925Z libcurand-10.3.7.77  | 39.9 MB   | #########7 |  97% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0493314Z 
2025-05-07T20:25:44.0493318Z 
2025-05-07T20:25:44.0493323Z 
2025-05-07T20:25:44.0493327Z 
2025-05-07T20:25:44.0493331Z 
2025-05-07T20:25:44.0493368Z 
2025-05-07T20:25:44.0493372Z 
2025-05-07T20:25:44.0493375Z 
2025-05-07T20:25:44.0702648Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########8  |  88% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1493066Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:25:44.1493377Z 
2025-05-07T20:25:44.1493382Z 
2025-05-07T20:25:44.1493387Z 
2025-05-07T20:25:44.1493391Z 
2025-05-07T20:25:44.1493395Z 
2025-05-07T20:25:44.1493399Z 
2025-05-07T20:25:44.1493403Z 
2025-05-07T20:25:44.1494742Z 
2025-05-07T20:25:44.1704509Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########5 |  96% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1923364Z nsight-compute-2024. | 443.1 MB  | #########6 |  97% 
2025-05-07T20:25:44.1923844Z 
2025-05-07T20:25:44.1923851Z 
2025-05-07T20:25:44.1925336Z 
2025-05-07T20:25:44.2713827Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:44.3721323Z nsight-compute-2024. | 443.1 MB  | #########7 |  97% 
2025-05-07T20:25:44.4722605Z nsight-compute-2024. | 443.1 MB  | #########8 |  98% 
2025-05-07T20:25:44.5725292Z nsight-compute-2024. | 443.1 MB  | #########8 |  99% 
2025-05-07T20:25:44.8576583Z nsight-compute-2024. | 443.1 MB  | #########9 | 100% 
2025-05-07T20:25:44.8576861Z 
2025-05-07T20:25:44.8576865Z 
2025-05-07T20:25:44.8576870Z 
2025-05-07T20:25:44.8576873Z 
2025-05-07T20:25:44.8576881Z 
2025-05-07T20:25:44.8576885Z 
2025-05-07T20:25:44.8576889Z 
2025-05-07T20:25:44.9051974Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:44.9052337Z 
2025-05-07T20:25:44.9052342Z 
2025-05-07T20:25:44.9052346Z 
2025-05-07T20:25:44.9052349Z 
2025-05-07T20:25:44.9052353Z 
2025-05-07T20:25:44.9052357Z 
2025-05-07T20:25:44.9052361Z 
2025-05-07T20:25:44.9052366Z 
2025-05-07T20:25:44.9052370Z 
2025-05-07T20:25:44.9054427Z 
2025-05-07T20:25:45.0052511Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0052955Z 
2025-05-07T20:25:45.0052961Z 
2025-05-07T20:25:45.0052967Z 
2025-05-07T20:25:45.0053014Z 
2025-05-07T20:25:45.0053020Z 
2025-05-07T20:25:45.0053026Z 
2025-05-07T20:25:45.0053032Z 
2025-05-07T20:25:45.0053037Z 
2025-05-07T20:25:45.0053044Z 
2025-05-07T20:25:45.0053049Z 
2025-05-07T20:25:45.1055515Z gds-tools-1.11.1.6   | 37.8 MB   | 8          |   8% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1055833Z 
2025-05-07T20:25:45.1055840Z 
2025-05-07T20:25:45.1055843Z 
2025-05-07T20:25:45.1055848Z 
2025-05-07T20:25:45.1055853Z 
2025-05-07T20:25:45.1055857Z 
2025-05-07T20:25:45.1056127Z 
2025-05-07T20:25:45.1056132Z 
2025-05-07T20:25:45.1056136Z 
2025-05-07T20:25:45.1056162Z 
2025-05-07T20:25:45.2055994Z gds-tools-1.11.1.6   | 37.8 MB   | #7         |  17% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2056315Z 
2025-05-07T20:25:45.2056319Z 
2025-05-07T20:25:45.2056323Z 
2025-05-07T20:25:45.2056326Z 
2025-05-07T20:25:45.2056330Z 
2025-05-07T20:25:45.2056335Z 
2025-05-07T20:25:45.2056338Z 
2025-05-07T20:25:45.2056343Z 
2025-05-07T20:25:45.2056347Z 
2025-05-07T20:25:45.2057655Z 
2025-05-07T20:25:45.3146564Z gds-tools-1.11.1.6   | 37.8 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3147013Z 
2025-05-07T20:25:45.3147019Z 
2025-05-07T20:25:45.3147024Z 
2025-05-07T20:25:45.3147044Z 
2025-05-07T20:25:45.3147052Z 
2025-05-07T20:25:45.3147057Z 
2025-05-07T20:25:45.3147063Z 
2025-05-07T20:25:45.3147069Z 
2025-05-07T20:25:45.3147075Z 
2025-05-07T20:25:45.3147082Z 
2025-05-07T20:25:45.3908268Z gds-tools-1.11.1.6   | 37.8 MB   | ###4       |  35% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3911610Z 
2025-05-07T20:25:45.4148850Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:45.4149131Z 
2025-05-07T20:25:45.4149136Z 
2025-05-07T20:25:45.4149139Z 
2025-05-07T20:25:45.4149143Z 
2025-05-07T20:25:45.4149148Z 
2025-05-07T20:25:45.4149154Z 
2025-05-07T20:25:45.4149159Z 
2025-05-07T20:25:45.4149163Z 
2025-05-07T20:25:45.4149167Z 
2025-05-07T20:25:45.4150397Z 
2025-05-07T20:25:45.4459848Z gds-tools-1.11.1.6   | 37.8 MB   | ####3      |  44% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4460261Z 
2025-05-07T20:25:45.4460267Z 
2025-05-07T20:25:45.4460273Z 
2025-05-07T20:25:45.4460278Z 
2025-05-07T20:25:45.4460283Z 
2025-05-07T20:25:45.4460288Z 
2025-05-07T20:25:45.4460293Z 
2025-05-07T20:25:45.4460299Z 
2025-05-07T20:25:45.4460304Z 
2025-05-07T20:25:45.4460311Z 
2025-05-07T20:25:45.4460318Z 
2025-05-07T20:25:45.4603524Z python-3.11.8        | 29.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4604034Z 
2025-05-07T20:25:45.4604039Z 
2025-05-07T20:25:45.4604043Z 
2025-05-07T20:25:45.4604046Z 
2025-05-07T20:25:45.4604061Z 
2025-05-07T20:25:45.4604065Z 
2025-05-07T20:25:45.4604068Z 
2025-05-07T20:25:45.4604072Z 
2025-05-07T20:25:45.4606970Z 
2025-05-07T20:25:45.5029518Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5029982Z 
2025-05-07T20:25:45.5029989Z 
2025-05-07T20:25:45.5029995Z 
2025-05-07T20:25:45.5030004Z 
2025-05-07T20:25:45.5030011Z 
2025-05-07T20:25:45.5030048Z 
2025-05-07T20:25:45.5030053Z 
2025-05-07T20:25:45.5030059Z 
2025-05-07T20:25:45.5030065Z 
2025-05-07T20:25:45.5030070Z 
2025-05-07T20:25:45.5030075Z 
2025-05-07T20:25:45.5031984Z 
2025-05-07T20:25:45.5219643Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5220042Z 
2025-05-07T20:25:45.5220047Z 
2025-05-07T20:25:45.5220051Z 
2025-05-07T20:25:45.5220056Z 
2025-05-07T20:25:45.5220060Z 
2025-05-07T20:25:45.5220064Z 
2025-05-07T20:25:45.5220083Z 
2025-05-07T20:25:45.5220087Z 
2025-05-07T20:25:45.5220091Z 
2025-05-07T20:25:45.5220100Z 
2025-05-07T20:25:45.5462827Z gds-tools-1.11.1.6   | 37.8 MB   | #####2     |  52% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5463306Z 
2025-05-07T20:25:45.5463312Z 
2025-05-07T20:25:45.5463317Z 
2025-05-07T20:25:45.5463321Z 
2025-05-07T20:25:45.5463326Z 
2025-05-07T20:25:45.5463329Z 
2025-05-07T20:25:45.5463333Z 
2025-05-07T20:25:45.5463337Z 
2025-05-07T20:25:45.5463353Z 
2025-05-07T20:25:45.5463357Z 
2025-05-07T20:25:45.5463386Z 
2025-05-07T20:25:45.6033102Z python-3.11.8        | 29.3 MB   | #          |  10% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6033401Z 
2025-05-07T20:25:45.6033414Z 
2025-05-07T20:25:45.6033418Z 
2025-05-07T20:25:45.6033422Z 
2025-05-07T20:25:45.6033427Z 
2025-05-07T20:25:45.6033431Z 
2025-05-07T20:25:45.6033435Z 
2025-05-07T20:25:45.6033440Z 
2025-05-07T20:25:45.6033443Z 
2025-05-07T20:25:45.6033447Z 
2025-05-07T20:25:45.6033451Z 
2025-05-07T20:25:45.6034964Z 
2025-05-07T20:25:45.6411229Z cuda-nvcc-tools-12.6 | 23.0 MB   | #          |  10% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6411631Z 
2025-05-07T20:25:45.6411636Z 
2025-05-07T20:25:45.6411640Z 
2025-05-07T20:25:45.6411643Z 
2025-05-07T20:25:45.6411647Z 
2025-05-07T20:25:45.6411651Z 
2025-05-07T20:25:45.6411654Z 
2025-05-07T20:25:45.6411658Z 
2025-05-07T20:25:45.6411662Z 
2025-05-07T20:25:45.6413920Z 
2025-05-07T20:25:45.6539553Z gds-tools-1.11.1.6   | 37.8 MB   | ######     |  60% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6540193Z 
2025-05-07T20:25:45.6540197Z 
2025-05-07T20:25:45.6540201Z 
2025-05-07T20:25:45.6540204Z 
2025-05-07T20:25:45.6540208Z 
2025-05-07T20:25:45.6540212Z 
2025-05-07T20:25:45.6540215Z 
2025-05-07T20:25:45.6540219Z 
2025-05-07T20:25:45.6540223Z 
2025-05-07T20:25:45.6540235Z 
2025-05-07T20:25:45.6540239Z 
2025-05-07T20:25:45.7034901Z python-3.11.8        | 29.3 MB   | ##         |  21% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7035199Z 
2025-05-07T20:25:45.7035227Z 
2025-05-07T20:25:45.7035231Z 
2025-05-07T20:25:45.7035242Z 
2025-05-07T20:25:45.7035246Z 
2025-05-07T20:25:45.7035250Z 
2025-05-07T20:25:45.7035254Z 
2025-05-07T20:25:45.7035258Z 
2025-05-07T20:25:45.7035261Z 
2025-05-07T20:25:45.7035265Z 
2025-05-07T20:25:45.7035269Z 
2025-05-07T20:25:45.7035278Z 
2025-05-07T20:25:45.7600587Z cuda-nvcc-tools-12.6 | 23.0 MB   | ##         |  21% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7601052Z 
2025-05-07T20:25:45.7601056Z 
2025-05-07T20:25:45.7601060Z 
2025-05-07T20:25:45.7601092Z 
2025-05-07T20:25:45.7601096Z 
2025-05-07T20:25:45.7601100Z 
2025-05-07T20:25:45.7601104Z 
2025-05-07T20:25:45.7601108Z 
2025-05-07T20:25:45.7601111Z 
2025-05-07T20:25:45.7605624Z 
2025-05-07T20:25:45.7619380Z gds-tools-1.11.1.6   | 37.8 MB   | ######8    |  68% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7619751Z 
2025-05-07T20:25:45.7619755Z 
2025-05-07T20:25:45.7619759Z 
2025-05-07T20:25:45.7619762Z 
2025-05-07T20:25:45.7619766Z 
2025-05-07T20:25:45.7619786Z 
2025-05-07T20:25:45.7619790Z 
2025-05-07T20:25:45.7619793Z 
2025-05-07T20:25:45.7619797Z 
2025-05-07T20:25:45.7619801Z 
2025-05-07T20:25:45.7621865Z 
2025-05-07T20:25:45.8035736Z python-3.11.8        | 29.3 MB   | ###        |  30% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8036143Z 
2025-05-07T20:25:45.8036149Z 
2025-05-07T20:25:45.8036155Z 
2025-05-07T20:25:45.8036168Z 
2025-05-07T20:25:45.8036174Z 
2025-05-07T20:25:45.8036179Z 
2025-05-07T20:25:45.8036184Z 
2025-05-07T20:25:45.8036190Z 
2025-05-07T20:25:45.8036216Z 
2025-05-07T20:25:45.8036221Z 
2025-05-07T20:25:45.8036226Z 
2025-05-07T20:25:45.8038154Z 
2025-05-07T20:25:45.8623008Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###2       |  32% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8623437Z 
2025-05-07T20:25:45.8623441Z 
2025-05-07T20:25:45.8623445Z 
2025-05-07T20:25:45.8623449Z 
2025-05-07T20:25:45.8623453Z 
2025-05-07T20:25:45.8623457Z 
2025-05-07T20:25:45.8623461Z 
2025-05-07T20:25:45.8623466Z 
2025-05-07T20:25:45.8623500Z 
2025-05-07T20:25:45.8623505Z 
2025-05-07T20:25:45.8627653Z 
2025-05-07T20:25:45.8673962Z python-3.11.8        | 29.3 MB   | ####       |  40% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8674345Z 
2025-05-07T20:25:45.8674352Z 
2025-05-07T20:25:45.8674357Z 
2025-05-07T20:25:45.8674363Z 
2025-05-07T20:25:45.8674368Z 
2025-05-07T20:25:45.8674374Z 
2025-05-07T20:25:45.8674380Z 
2025-05-07T20:25:45.8674386Z 
2025-05-07T20:25:45.8674391Z 
2025-05-07T20:25:45.8674396Z 
2025-05-07T20:25:45.9039476Z gds-tools-1.11.1.6   | 37.8 MB   | #######5   |  76% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9039904Z 
2025-05-07T20:25:45.9039910Z 
2025-05-07T20:25:45.9039915Z 
2025-05-07T20:25:45.9039930Z 
2025-05-07T20:25:45.9039935Z 
2025-05-07T20:25:45.9039960Z 
2025-05-07T20:25:45.9039965Z 
2025-05-07T20:25:45.9039970Z 
2025-05-07T20:25:45.9039975Z 
2025-05-07T20:25:45.9039980Z 
2025-05-07T20:25:45.9039985Z 
2025-05-07T20:25:45.9039990Z 
2025-05-07T20:25:45.9419014Z cuda-nvcc-tools-12.6 | 23.0 MB   | ####4      |  44% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9419347Z 
2025-05-07T20:25:45.9419351Z 
2025-05-07T20:25:45.9419365Z 
2025-05-07T20:25:45.9419369Z 
2025-05-07T20:25:45.9419374Z 
2025-05-07T20:25:45.9419378Z 
2025-05-07T20:25:45.9419381Z 
2025-05-07T20:25:45.9421503Z 
2025-05-07T20:25:45.9626232Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9626552Z 
2025-05-07T20:25:45.9626556Z 
2025-05-07T20:25:45.9626560Z 
2025-05-07T20:25:45.9626564Z 
2025-05-07T20:25:45.9626838Z 
2025-05-07T20:25:45.9626842Z 
2025-05-07T20:25:45.9626854Z 
2025-05-07T20:25:45.9626857Z 
2025-05-07T20:25:45.9626861Z 
2025-05-07T20:25:45.9626865Z 
2025-05-07T20:25:45.9627966Z 
2025-05-07T20:25:45.9677027Z python-3.11.8        | 29.3 MB   | #####      |  50% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9677324Z 
2025-05-07T20:25:45.9677328Z 
2025-05-07T20:25:45.9677332Z 
2025-05-07T20:25:45.9677336Z 
2025-05-07T20:25:45.9677340Z 
2025-05-07T20:25:45.9677344Z 
2025-05-07T20:25:45.9677360Z 
2025-05-07T20:25:45.9677364Z 
2025-05-07T20:25:45.9677368Z 
2025-05-07T20:25:45.9679518Z 
2025-05-07T20:25:46.0040895Z gds-tools-1.11.1.6   | 37.8 MB   | ########3  |  83% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0041329Z 
2025-05-07T20:25:46.0041335Z 
2025-05-07T20:25:46.0041341Z 
2025-05-07T20:25:46.0041346Z 
2025-05-07T20:25:46.0041351Z 
2025-05-07T20:25:46.0041355Z 
2025-05-07T20:25:46.0041360Z 
2025-05-07T20:25:46.0041364Z 
2025-05-07T20:25:46.0041368Z 
2025-05-07T20:25:46.0041373Z 
2025-05-07T20:25:46.0041413Z 
2025-05-07T20:25:46.0043391Z 
2025-05-07T20:25:46.0191757Z cuda-nvcc-tools-12.6 | 23.0 MB   | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0192174Z 
2025-05-07T20:25:46.0192178Z 
2025-05-07T20:25:46.0192182Z 
2025-05-07T20:25:46.0192186Z 
2025-05-07T20:25:46.0192190Z 
2025-05-07T20:25:46.0192194Z 
2025-05-07T20:25:46.0192198Z 
2025-05-07T20:25:46.0192210Z 
2025-05-07T20:25:46.0192214Z 
2025-05-07T20:25:46.0192218Z 
2025-05-07T20:25:46.0192238Z 
2025-05-07T20:25:46.0192242Z 
2025-05-07T20:25:46.0192247Z 
2025-05-07T20:25:46.0630038Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0630543Z 
2025-05-07T20:25:46.0630550Z 
2025-05-07T20:25:46.0630555Z 
2025-05-07T20:25:46.0630561Z 
2025-05-07T20:25:46.0630566Z 
2025-05-07T20:25:46.0630571Z 
2025-05-07T20:25:46.0630577Z 
2025-05-07T20:25:46.0630582Z 
2025-05-07T20:25:46.0630587Z 
2025-05-07T20:25:46.0630592Z 
2025-05-07T20:25:46.0632172Z 
2025-05-07T20:25:46.0676944Z python-3.11.8        | 29.3 MB   | ######     |  60% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0677380Z 
2025-05-07T20:25:46.0677386Z 
2025-05-07T20:25:46.0677391Z 
2025-05-07T20:25:46.0677396Z 
2025-05-07T20:25:46.0677401Z 
2025-05-07T20:25:46.0677407Z 
2025-05-07T20:25:46.0677412Z 
2025-05-07T20:25:46.0677417Z 
2025-05-07T20:25:46.0677422Z 
2025-05-07T20:25:46.0677428Z 
2025-05-07T20:25:46.1195777Z gds-tools-1.11.1.6   | 37.8 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1196097Z 
2025-05-07T20:25:46.1196101Z 
2025-05-07T20:25:46.1196105Z 
2025-05-07T20:25:46.1196109Z 
2025-05-07T20:25:46.1196113Z 
2025-05-07T20:25:46.1196117Z 
2025-05-07T20:25:46.1196120Z 
2025-05-07T20:25:46.1196131Z 
2025-05-07T20:25:46.1196135Z 
2025-05-07T20:25:46.1196139Z 
2025-05-07T20:25:46.1196143Z 
2025-05-07T20:25:46.1196146Z 
2025-05-07T20:25:46.1196150Z 
2025-05-07T20:25:46.1235656Z cuda-nvrtc-12.6.85   | 17.3 MB   | #2         |  13% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1236114Z 
2025-05-07T20:25:46.1236120Z 
2025-05-07T20:25:46.1236125Z 
2025-05-07T20:25:46.1236130Z 
2025-05-07T20:25:46.1236136Z 
2025-05-07T20:25:46.1236140Z 
2025-05-07T20:25:46.1236145Z 
2025-05-07T20:25:46.1236151Z 
2025-05-07T20:25:46.1236156Z 
2025-05-07T20:25:46.1236161Z 
2025-05-07T20:25:46.1236166Z 
2025-05-07T20:25:46.1237409Z 
2025-05-07T20:25:46.1702002Z cuda-nvcc-tools-12.6 | 23.0 MB   | ######7    |  68% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1702660Z 
2025-05-07T20:25:46.1702666Z 
2025-05-07T20:25:46.1702670Z 
2025-05-07T20:25:46.1702674Z 
2025-05-07T20:25:46.1702678Z 
2025-05-07T20:25:46.1702682Z 
2025-05-07T20:25:46.1702685Z 
2025-05-07T20:25:46.1702689Z 
2025-05-07T20:25:46.1702693Z 
2025-05-07T20:25:46.1702697Z 
2025-05-07T20:25:46.1702700Z 
2025-05-07T20:25:46.1836253Z python-3.11.8        | 29.3 MB   | #######    |  70% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1836629Z 
2025-05-07T20:25:46.1836634Z 
2025-05-07T20:25:46.1836866Z 
2025-05-07T20:25:46.1836870Z 
2025-05-07T20:25:46.1836874Z 
2025-05-07T20:25:46.1836878Z 
2025-05-07T20:25:46.1836894Z 
2025-05-07T20:25:46.1836898Z 
2025-05-07T20:25:46.1836902Z 
2025-05-07T20:25:46.1838257Z 
2025-05-07T20:25:46.2198570Z gds-tools-1.11.1.6   | 37.8 MB   | #########8 |  99% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2198957Z 
2025-05-07T20:25:46.2198961Z 
2025-05-07T20:25:46.2198965Z 
2025-05-07T20:25:46.2198968Z 
2025-05-07T20:25:46.2198972Z 
2025-05-07T20:25:46.2198992Z 
2025-05-07T20:25:46.2198996Z 
2025-05-07T20:25:46.2198999Z 
2025-05-07T20:25:46.2199003Z 
2025-05-07T20:25:46.2199007Z 
2025-05-07T20:25:46.2199010Z 
2025-05-07T20:25:46.2199014Z 
2025-05-07T20:25:46.2203535Z 
2025-05-07T20:25:46.2239903Z cuda-nvrtc-12.6.85   | 17.3 MB   | ##5        |  25% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2240351Z 
2025-05-07T20:25:46.2240357Z 
2025-05-07T20:25:46.2240363Z 
2025-05-07T20:25:46.2240369Z 
2025-05-07T20:25:46.2240374Z 
2025-05-07T20:25:46.2240380Z 
2025-05-07T20:25:46.2240399Z 
2025-05-07T20:25:46.2240404Z 
2025-05-07T20:25:46.2240410Z 
2025-05-07T20:25:46.2240415Z 
2025-05-07T20:25:46.2240421Z 
2025-05-07T20:25:46.2240426Z 
2025-05-07T20:25:46.2709394Z cuda-nvcc-tools-12.6 | 23.0 MB   | #######8   |  79% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2709734Z 
2025-05-07T20:25:46.2709739Z 
2025-05-07T20:25:46.2709742Z 
2025-05-07T20:25:46.2709746Z 
2025-05-07T20:25:46.2709750Z 
2025-05-07T20:25:46.2709754Z 
2025-05-07T20:25:46.2709784Z 
2025-05-07T20:25:46.2709789Z 
2025-05-07T20:25:46.2709793Z 
2025-05-07T20:25:46.2709797Z 
2025-05-07T20:25:46.2710206Z 
2025-05-07T20:25:46.3199352Z python-3.11.8        | 29.3 MB   | ########   |  80% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3199666Z 
2025-05-07T20:25:46.3202100Z 
2025-05-07T20:25:46.3202110Z 
2025-05-07T20:25:46.3202116Z 
2025-05-07T20:25:46.3202122Z 
2025-05-07T20:25:46.3202128Z 
2025-05-07T20:25:46.3202133Z 
2025-05-07T20:25:46.3202139Z 
2025-05-07T20:25:46.3202144Z 
2025-05-07T20:25:46.3202188Z 
2025-05-07T20:25:46.3202255Z 
2025-05-07T20:25:46.3202265Z 
2025-05-07T20:25:46.3202305Z 
2025-05-07T20:25:46.3249019Z cuda-nvrtc-12.6.85   | 17.3 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3249345Z 
2025-05-07T20:25:46.3249350Z 
2025-05-07T20:25:46.3249355Z 
2025-05-07T20:25:46.3249358Z 
2025-05-07T20:25:46.3249363Z 
2025-05-07T20:25:46.3249367Z 
2025-05-07T20:25:46.3249372Z 
2025-05-07T20:25:46.3249376Z 
2025-05-07T20:25:46.3249394Z 
2025-05-07T20:25:46.3249399Z 
2025-05-07T20:25:46.3249403Z 
2025-05-07T20:25:46.3249409Z 
2025-05-07T20:25:46.3718064Z cuda-nvcc-tools-12.6 | 23.0 MB   | #########  |  91% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3718408Z 
2025-05-07T20:25:46.3718412Z 
2025-05-07T20:25:46.3718416Z 
2025-05-07T20:25:46.3718420Z 
2025-05-07T20:25:46.3718424Z 
2025-05-07T20:25:46.3718427Z 
2025-05-07T20:25:46.3718431Z 
2025-05-07T20:25:46.3718435Z 
2025-05-07T20:25:46.3718439Z 
2025-05-07T20:25:46.3718443Z 
2025-05-07T20:25:46.3719950Z 
2025-05-07T20:25:46.4199901Z python-3.11.8        | 29.3 MB   | ########9  |  90% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4200198Z 
2025-05-07T20:25:46.4200202Z 
2025-05-07T20:25:46.4200206Z 
2025-05-07T20:25:46.4200210Z 
2025-05-07T20:25:46.4200214Z 
2025-05-07T20:25:46.4200218Z 
2025-05-07T20:25:46.4200222Z 
2025-05-07T20:25:46.4200235Z 
2025-05-07T20:25:46.4200239Z 
2025-05-07T20:25:46.4200243Z 
2025-05-07T20:25:46.4200247Z 
2025-05-07T20:25:46.4200252Z 
2025-05-07T20:25:46.4200520Z 
2025-05-07T20:25:46.5205655Z cuda-nvrtc-12.6.85   | 17.3 MB   | #####2     |  53% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.5206008Z 
2025-05-07T20:25:46.5206012Z 
2025-05-07T20:25:46.5206015Z 
2025-05-07T20:25:46.5206019Z 
2025-05-07T20:25:46.5206023Z 
2025-05-07T20:25:46.5206027Z 
2025-05-07T20:25:46.5206031Z 
2025-05-07T20:25:46.5206034Z 
2025-05-07T20:25:46.5206038Z 
2025-05-07T20:25:46.5206042Z 
2025-05-07T20:25:46.5206046Z 
2025-05-07T20:25:46.5206050Z 
2025-05-07T20:25:46.5206304Z 
2025-05-07T20:25:46.6208306Z cuda-nvrtc-12.6.85   | 17.3 MB   | #######1   |  71% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.6208625Z 
2025-05-07T20:25:46.6208629Z 
2025-05-07T20:25:46.6208633Z 
2025-05-07T20:25:46.6208637Z 
2025-05-07T20:25:46.6208641Z 
2025-05-07T20:25:46.6208646Z 
2025-05-07T20:25:46.6208654Z 
2025-05-07T20:25:46.6208658Z 
2025-05-07T20:25:46.6208663Z 
2025-05-07T20:25:46.6208667Z 
2025-05-07T20:25:46.6208671Z 
2025-05-07T20:25:46.6208684Z 
2025-05-07T20:25:46.6209200Z 
2025-05-07T20:25:46.7243926Z cuda-nvrtc-12.6.85   | 17.3 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7244257Z 
2025-05-07T20:25:46.7244266Z 
2025-05-07T20:25:47.2120017Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:47.2120301Z 
2025-05-07T20:25:47.2120305Z 
2025-05-07T20:25:47.2120309Z 
2025-05-07T20:25:47.2120313Z 
2025-05-07T20:25:47.2120316Z 
2025-05-07T20:25:47.2120320Z 
2025-05-07T20:25:47.2120324Z 
2025-05-07T20:25:47.2120354Z 
2025-05-07T20:25:47.2120358Z 
2025-05-07T20:25:47.2120361Z 
2025-05-07T20:25:47.2120366Z 
2025-05-07T20:25:47.2121522Z 
2025-05-07T20:25:47.2214085Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2214406Z 
2025-05-07T20:25:47.2214411Z 
2025-05-07T20:25:47.2214414Z 
2025-05-07T20:25:47.2214418Z 
2025-05-07T20:25:47.2214422Z 
2025-05-07T20:25:47.2214426Z 
2025-05-07T20:25:47.2214430Z 
2025-05-07T20:25:47.2214434Z 
2025-05-07T20:25:47.2214451Z 
2025-05-07T20:25:47.2214455Z 
2025-05-07T20:25:47.2214458Z 
2025-05-07T20:25:47.2214462Z 
2025-05-07T20:25:47.2220500Z 
2025-05-07T20:25:47.2634235Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2634660Z 
2025-05-07T20:25:47.2634666Z 
2025-05-07T20:25:47.2634685Z 
2025-05-07T20:25:47.2634690Z 
2025-05-07T20:25:47.2634696Z 
2025-05-07T20:25:47.2634701Z 
2025-05-07T20:25:47.2634707Z 
2025-05-07T20:25:47.2634710Z 
2025-05-07T20:25:47.2634714Z 
2025-05-07T20:25:47.2634728Z 
2025-05-07T20:25:47.2634732Z 
2025-05-07T20:25:47.2634736Z 
2025-05-07T20:25:47.2634740Z 
2025-05-07T20:25:47.2638080Z 
2025-05-07T20:25:47.2871213Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2872191Z 
2025-05-07T20:25:47.2872195Z 
2025-05-07T20:25:47.2872199Z 
2025-05-07T20:25:47.2872203Z 
2025-05-07T20:25:47.2872206Z 
2025-05-07T20:25:47.2872210Z 
2025-05-07T20:25:47.2872214Z 
2025-05-07T20:25:47.2872232Z 
2025-05-07T20:25:47.2872236Z 
2025-05-07T20:25:47.2872240Z 
2025-05-07T20:25:47.2872244Z 
2025-05-07T20:25:47.2872248Z 
2025-05-07T20:25:47.2872251Z 
2025-05-07T20:25:47.2872255Z 
2025-05-07T20:25:47.2873371Z 
2025-05-07T20:25:47.3634977Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.3635320Z 
2025-05-07T20:25:47.3635325Z 
2025-05-07T20:25:47.3635328Z 
2025-05-07T20:25:47.3635340Z 
2025-05-07T20:25:47.3635344Z 
2025-05-07T20:25:47.3635359Z 
2025-05-07T20:25:47.3635363Z 
2025-05-07T20:25:47.3635367Z 
2025-05-07T20:25:47.3635370Z 
2025-05-07T20:25:47.3635374Z 
2025-05-07T20:25:47.3635378Z 
2025-05-07T20:25:47.3635381Z 
2025-05-07T20:25:47.3635385Z 
2025-05-07T20:25:47.3642174Z 
2025-05-07T20:25:47.3873056Z libnvjitlink-12.6.85 | 14.9 MB   | ##         |  20% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.3873398Z 
2025-05-07T20:25:47.3873402Z 
2025-05-07T20:25:47.3873406Z 
2025-05-07T20:25:47.3873668Z 
2025-05-07T20:25:47.3873677Z 
2025-05-07T20:25:47.3873683Z 
2025-05-07T20:25:47.3873688Z 
2025-05-07T20:25:47.3873694Z 
2025-05-07T20:25:47.3873699Z 
2025-05-07T20:25:47.3873704Z 
2025-05-07T20:25:47.3873708Z 
2025-05-07T20:25:47.3873712Z 
2025-05-07T20:25:47.3873715Z 
2025-05-07T20:25:47.3873719Z 
2025-05-07T20:25:47.3873726Z 
2025-05-07T20:25:47.4241322Z cuda-nvcc-dev_linux- | 10.8 MB   | ##7        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4241662Z 
2025-05-07T20:25:47.4241666Z 
2025-05-07T20:25:47.4241895Z 
2025-05-07T20:25:47.4241899Z 
2025-05-07T20:25:47.4241911Z 
2025-05-07T20:25:47.4241915Z 
2025-05-07T20:25:47.4241919Z 
2025-05-07T20:25:47.4241923Z 
2025-05-07T20:25:47.4241926Z 
2025-05-07T20:25:47.4241930Z 
2025-05-07T20:25:47.4634805Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4635191Z 
2025-05-07T20:25:47.4635195Z 
2025-05-07T20:25:47.4635199Z 
2025-05-07T20:25:47.4635203Z 
2025-05-07T20:25:47.4635207Z 
2025-05-07T20:25:47.4635220Z 
2025-05-07T20:25:47.4635224Z 
2025-05-07T20:25:47.4635234Z 
2025-05-07T20:25:47.4635238Z 
2025-05-07T20:25:47.4635242Z 
2025-05-07T20:25:47.4635245Z 
2025-05-07T20:25:47.4635249Z 
2025-05-07T20:25:47.4635253Z 
2025-05-07T20:25:47.4635257Z 
2025-05-07T20:25:47.4703129Z libnvjitlink-12.6.85 | 14.9 MB   | ####1      |  41% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4703466Z 
2025-05-07T20:25:47.4703470Z 
2025-05-07T20:25:47.4703474Z 
2025-05-07T20:25:47.4703478Z 
2025-05-07T20:25:47.4703491Z 
2025-05-07T20:25:47.4703495Z 
2025-05-07T20:25:47.4703499Z 
2025-05-07T20:25:47.4703502Z 
2025-05-07T20:25:47.4703506Z 
2025-05-07T20:25:47.4703510Z 
2025-05-07T20:25:47.4703514Z 
2025-05-07T20:25:47.4703518Z 
2025-05-07T20:25:47.4703522Z 
2025-05-07T20:25:47.4703525Z 
2025-05-07T20:25:47.4703529Z 
2025-05-07T20:25:47.4706232Z 
2025-05-07T20:25:47.4832619Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4833001Z 
2025-05-07T20:25:47.4833006Z 
2025-05-07T20:25:47.4833009Z 
2025-05-07T20:25:47.4833013Z 
2025-05-07T20:25:47.4833017Z 
2025-05-07T20:25:47.4833021Z 
2025-05-07T20:25:47.4833025Z 
2025-05-07T20:25:47.4833028Z 
2025-05-07T20:25:47.4833040Z 
2025-05-07T20:25:47.4833043Z 
2025-05-07T20:25:47.4834287Z 
2025-05-07T20:25:47.4842668Z python-3.11.8        | 29.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4842971Z 
2025-05-07T20:25:47.4842975Z 
2025-05-07T20:25:47.4842978Z 
2025-05-07T20:25:47.4842994Z 
2025-05-07T20:25:47.4842997Z 
2025-05-07T20:25:47.4843001Z 
2025-05-07T20:25:47.4843004Z 
2025-05-07T20:25:47.4843008Z 
2025-05-07T20:25:47.4843012Z 
2025-05-07T20:25:47.4843016Z 
2025-05-07T20:25:47.4843020Z 
2025-05-07T20:25:47.4876462Z python-3.11.8        | 29.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4876851Z 
2025-05-07T20:25:47.4876856Z 
2025-05-07T20:25:47.4876861Z 
2025-05-07T20:25:47.4876866Z 
2025-05-07T20:25:47.4876872Z 
2025-05-07T20:25:47.4876888Z 
2025-05-07T20:25:47.4876893Z 
2025-05-07T20:25:47.4876898Z 
2025-05-07T20:25:47.4876904Z 
2025-05-07T20:25:47.4876909Z 
2025-05-07T20:25:47.4876914Z 
2025-05-07T20:25:47.4876919Z 
2025-05-07T20:25:47.4876925Z 
2025-05-07T20:25:47.4876930Z 
2025-05-07T20:25:47.4876935Z 
2025-05-07T20:25:47.5341913Z cuda-nvcc-dev_linux- | 10.8 MB   | #####5     |  56% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5342371Z 
2025-05-07T20:25:47.5342377Z 
2025-05-07T20:25:47.5342382Z 
2025-05-07T20:25:47.5342403Z 
2025-05-07T20:25:47.5342419Z 
2025-05-07T20:25:47.5342425Z 
2025-05-07T20:25:47.5342430Z 
2025-05-07T20:25:47.5342435Z 
2025-05-07T20:25:47.5342440Z 
2025-05-07T20:25:47.5342445Z 
2025-05-07T20:25:47.5342450Z 
2025-05-07T20:25:47.5342455Z 
2025-05-07T20:25:47.5342460Z 
2025-05-07T20:25:47.5342466Z 
2025-05-07T20:25:47.5342471Z 
2025-05-07T20:25:47.5342476Z 
2025-05-07T20:25:47.5342481Z 
2025-05-07T20:25:47.5704633Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5704986Z 
2025-05-07T20:25:47.5704990Z 
2025-05-07T20:25:47.5704994Z 
2025-05-07T20:25:47.5705007Z 
2025-05-07T20:25:47.5705010Z 
2025-05-07T20:25:47.5705014Z 
2025-05-07T20:25:47.5705018Z 
2025-05-07T20:25:47.5705021Z 
2025-05-07T20:25:47.5705025Z 
2025-05-07T20:25:47.5705029Z 
2025-05-07T20:25:47.5705032Z 
2025-05-07T20:25:47.5705036Z 
2025-05-07T20:25:47.5705039Z 
2025-05-07T20:25:47.5705043Z 
2025-05-07T20:25:47.5705047Z 
2025-05-07T20:25:47.5706792Z 
2025-05-07T20:25:47.5757428Z cuda-nvvm-tools-12.6 | 10.4 MB   | ##6        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5757856Z 
2025-05-07T20:25:47.5757860Z 
2025-05-07T20:25:47.5757879Z 
2025-05-07T20:25:47.5757882Z 
2025-05-07T20:25:47.5757886Z 
2025-05-07T20:25:47.5757890Z 
2025-05-07T20:25:47.5757893Z 
2025-05-07T20:25:47.5757897Z 
2025-05-07T20:25:47.5757901Z 
2025-05-07T20:25:47.5757904Z 
2025-05-07T20:25:47.5757908Z 
2025-05-07T20:25:47.5757911Z 
2025-05-07T20:25:47.5757924Z 
2025-05-07T20:25:47.5757928Z 
2025-05-07T20:25:47.6040269Z libnvjitlink-12.6.85 | 14.9 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.6040700Z 
2025-05-07T20:25:47.6040706Z 
2025-05-07T20:25:47.6040711Z 
2025-05-07T20:25:47.6040716Z 
2025-05-07T20:25:47.6040722Z 
2025-05-07T20:25:47.6040727Z 
2025-05-07T20:25:47.6040732Z 
2025-05-07T20:25:47.6040737Z 
2025-05-07T20:25:47.6040742Z 
2025-05-07T20:25:47.6040748Z 
2025-05-07T20:25:47.6040752Z 
2025-05-07T20:25:47.6040771Z 
2025-05-07T20:25:47.6040781Z 
2025-05-07T20:25:47.6040786Z 
2025-05-07T20:25:47.6044030Z 
2025-05-07T20:25:47.6343021Z cuda-nvcc-dev_linux- | 10.8 MB   | ########3  |  84% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.6343504Z 
2025-05-07T20:25:47.6343519Z 
2025-05-07T20:25:47.6343523Z 
2025-05-07T20:25:47.6343526Z 
2025-05-07T20:25:47.6343530Z 
2025-05-07T20:25:47.6343534Z 
2025-05-07T20:25:47.6343538Z 
2025-05-07T20:25:47.6343542Z 
2025-05-07T20:25:47.6343556Z 
2025-05-07T20:25:47.6343560Z 
2025-05-07T20:25:47.6343563Z 
2025-05-07T20:25:47.6343567Z 
2025-05-07T20:25:47.6343571Z 
2025-05-07T20:25:47.6343574Z 
2025-05-07T20:25:47.6343578Z 
2025-05-07T20:25:47.6343582Z 
2025-05-07T20:25:47.6345164Z 
2025-05-07T20:25:47.6814047Z cuda-sanitizer-api-1 | 8.9 MB    | ##6        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.6814401Z 
2025-05-07T20:25:47.6814406Z 
2025-05-07T20:25:47.6814409Z 
2025-05-07T20:25:47.6814413Z 
2025-05-07T20:25:47.6814417Z 
2025-05-07T20:25:47.6814432Z 
2025-05-07T20:25:47.6814445Z 
2025-05-07T20:25:47.6814449Z 
2025-05-07T20:25:47.6814452Z 
2025-05-07T20:25:47.6814456Z 
2025-05-07T20:25:47.6814459Z 
2025-05-07T20:25:47.6814463Z 
2025-05-07T20:25:47.6814467Z 
2025-05-07T20:25:47.6814470Z 
2025-05-07T20:25:47.6814474Z 
2025-05-07T20:25:47.6816681Z 
2025-05-07T20:25:47.6921089Z cuda-nvvm-tools-12.6 | 10.4 MB   | #####3     |  54% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.6921420Z 
2025-05-07T20:25:47.6921433Z 
2025-05-07T20:25:47.6921437Z 
2025-05-07T20:25:47.6921440Z 
2025-05-07T20:25:47.6921444Z 
2025-05-07T20:25:47.6921448Z 
2025-05-07T20:25:47.6921451Z 
2025-05-07T20:25:47.6921455Z 
2025-05-07T20:25:47.6921459Z 
2025-05-07T20:25:47.6921462Z 
2025-05-07T20:25:47.6921466Z 
2025-05-07T20:25:47.6921470Z 
2025-05-07T20:25:47.6921474Z 
2025-05-07T20:25:47.6923015Z 
2025-05-07T20:25:47.7346196Z libnvjitlink-12.6.85 | 14.9 MB   | ########1  |  82% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.7346523Z 
2025-05-07T20:25:47.7346527Z 
2025-05-07T20:25:47.7346531Z 
2025-05-07T20:25:47.7346535Z 
2025-05-07T20:25:47.7346539Z 
2025-05-07T20:25:47.7346543Z 
2025-05-07T20:25:47.7346608Z 
2025-05-07T20:25:47.7346616Z 
2025-05-07T20:25:47.7346622Z 
2025-05-07T20:25:47.7346628Z 
2025-05-07T20:25:47.7346634Z 
2025-05-07T20:25:47.7346644Z 
2025-05-07T20:25:47.7346651Z 
2025-05-07T20:25:47.7346656Z 
2025-05-07T20:25:47.7346662Z 
2025-05-07T20:25:47.7346667Z 
2025-05-07T20:25:47.7350823Z 
2025-05-07T20:25:47.7818567Z cuda-sanitizer-api-1 | 8.9 MB    | #####3     |  54% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.7818945Z 
2025-05-07T20:25:47.7818949Z 
2025-05-07T20:25:47.7818953Z 
2025-05-07T20:25:47.7818957Z 
2025-05-07T20:25:47.7818961Z 
2025-05-07T20:25:47.7818964Z 
2025-05-07T20:25:47.7818968Z 
2025-05-07T20:25:47.7818972Z 
2025-05-07T20:25:47.7818976Z 
2025-05-07T20:25:47.7818980Z 
2025-05-07T20:25:47.7818984Z 
2025-05-07T20:25:47.7818988Z 
2025-05-07T20:25:47.7818991Z 
2025-05-07T20:25:47.7819153Z 
2025-05-07T20:25:47.7819157Z 
2025-05-07T20:25:47.7821687Z 
2025-05-07T20:25:47.8349837Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########2  |  82% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.8350281Z 
2025-05-07T20:25:47.8350286Z 
2025-05-07T20:25:47.8350290Z 
2025-05-07T20:25:47.8350293Z 
2025-05-07T20:25:47.8350297Z 
2025-05-07T20:25:47.8350301Z 
2025-05-07T20:25:47.8350305Z 
2025-05-07T20:25:47.8350308Z 
2025-05-07T20:25:47.8350312Z 
2025-05-07T20:25:47.8350327Z 
2025-05-07T20:25:47.8350331Z 
2025-05-07T20:25:47.8350335Z 
2025-05-07T20:25:47.8350338Z 
2025-05-07T20:25:47.8350342Z 
2025-05-07T20:25:47.8350345Z 
2025-05-07T20:25:47.8350349Z 
2025-05-07T20:25:47.8350353Z 
2025-05-07T20:25:48.0415036Z cuda-sanitizer-api-1 | 8.9 MB    | ########5  |  86% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.0415441Z 
2025-05-07T20:25:48.0415445Z 
2025-05-07T20:25:48.0415449Z 
2025-05-07T20:25:48.0415453Z 
2025-05-07T20:25:48.0415466Z 
2025-05-07T20:25:48.0415482Z 
2025-05-07T20:25:48.0415486Z 
2025-05-07T20:25:48.0415489Z 
2025-05-07T20:25:48.0415493Z 
2025-05-07T20:25:48.0415497Z 
2025-05-07T20:25:48.0415501Z 
2025-05-07T20:25:48.0415504Z 
2025-05-07T20:25:48.0415508Z 
2025-05-07T20:25:48.0415512Z 
2025-05-07T20:25:48.0418090Z 
2025-05-07T20:25:48.1045986Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1046396Z 
2025-05-07T20:25:48.1046402Z 
2025-05-07T20:25:48.1046423Z 
2025-05-07T20:25:48.1046429Z 
2025-05-07T20:25:48.1046434Z 
2025-05-07T20:25:48.1046439Z 
2025-05-07T20:25:48.1046444Z 
2025-05-07T20:25:48.1046450Z 
2025-05-07T20:25:48.1046455Z 
2025-05-07T20:25:48.1046461Z 
2025-05-07T20:25:48.1046466Z 
2025-05-07T20:25:48.1046471Z 
2025-05-07T20:25:48.1046476Z 
2025-05-07T20:25:48.1046482Z 
2025-05-07T20:25:48.1046496Z 
2025-05-07T20:25:48.1046502Z 
2025-05-07T20:25:48.1046507Z 
2025-05-07T20:25:48.1046513Z 
2025-05-07T20:25:48.1313697Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1314182Z 
2025-05-07T20:25:48.1314187Z 
2025-05-07T20:25:48.1314190Z 
2025-05-07T20:25:48.1314201Z 
2025-05-07T20:25:48.1314205Z 
2025-05-07T20:25:48.1314209Z 
2025-05-07T20:25:48.1314213Z 
2025-05-07T20:25:48.1314217Z 
2025-05-07T20:25:48.1314220Z 
2025-05-07T20:25:48.1314224Z 
2025-05-07T20:25:48.1314228Z 
2025-05-07T20:25:48.1314232Z 
2025-05-07T20:25:48.1314235Z 
2025-05-07T20:25:48.1314239Z 
2025-05-07T20:25:48.1314250Z 
2025-05-07T20:25:48.1314254Z 
2025-05-07T20:25:48.1314257Z 
2025-05-07T20:25:48.1509098Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1509476Z 
2025-05-07T20:25:48.1509480Z 
2025-05-07T20:25:48.1509484Z 
2025-05-07T20:25:48.1509488Z 
2025-05-07T20:25:48.1509491Z 
2025-05-07T20:25:48.1509495Z 
2025-05-07T20:25:48.1509499Z 
2025-05-07T20:25:48.1509503Z 
2025-05-07T20:25:48.1509507Z 
2025-05-07T20:25:48.1509510Z 
2025-05-07T20:25:48.1509527Z 
2025-05-07T20:25:48.1509531Z 
2025-05-07T20:25:48.1509534Z 
2025-05-07T20:25:48.1509538Z 
2025-05-07T20:25:48.1509547Z 
2025-05-07T20:25:48.1513276Z 
2025-05-07T20:25:48.1732021Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1732480Z 
2025-05-07T20:25:48.1732485Z 
2025-05-07T20:25:48.1732491Z 
2025-05-07T20:25:48.1732496Z 
2025-05-07T20:25:48.1732502Z 
2025-05-07T20:25:48.1732508Z 
2025-05-07T20:25:48.1732791Z 
2025-05-07T20:25:48.1732800Z 
2025-05-07T20:25:48.1732806Z 
2025-05-07T20:25:48.1732811Z 
2025-05-07T20:25:48.1732816Z 
2025-05-07T20:25:48.1732821Z 
2025-05-07T20:25:48.1732826Z 
2025-05-07T20:25:48.1732831Z 
2025-05-07T20:25:48.1732836Z 
2025-05-07T20:25:48.1732840Z 
2025-05-07T20:25:48.1732843Z 
2025-05-07T20:25:48.1732847Z 
2025-05-07T20:25:48.1733032Z 
2025-05-07T20:25:48.2048181Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.2048489Z 
2025-05-07T20:25:48.2048730Z 
2025-05-07T20:25:48.2048734Z 
2025-05-07T20:25:48.2048738Z 
2025-05-07T20:25:48.2048742Z 
2025-05-07T20:25:48.2048746Z 
2025-05-07T20:25:48.2048750Z 
2025-05-07T20:25:48.2048763Z 
2025-05-07T20:25:48.2048767Z 
2025-05-07T20:25:48.2048771Z 
2025-05-07T20:25:48.2048775Z 
2025-05-07T20:25:48.2048779Z 
2025-05-07T20:25:48.2048782Z 
2025-05-07T20:25:48.2048786Z 
2025-05-07T20:25:48.2048789Z 
2025-05-07T20:25:48.2048793Z 
2025-05-07T20:25:48.2048797Z 
2025-05-07T20:25:48.2048801Z 
2025-05-07T20:25:48.2149192Z cuda-nvvm-impl-12.6. | 7.7 MB    | ####3      |  44% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.2149826Z 
2025-05-07T20:25:48.2149831Z 
2025-05-07T20:25:48.2149835Z 
2025-05-07T20:25:48.2149838Z 
2025-05-07T20:25:48.2149842Z 
2025-05-07T20:25:48.2149846Z 
2025-05-07T20:25:48.2149850Z 
2025-05-07T20:25:48.2149853Z 
2025-05-07T20:25:48.2149857Z 
2025-05-07T20:25:48.2149861Z 
2025-05-07T20:25:48.2149865Z 
2025-05-07T20:25:48.2149868Z 
2025-05-07T20:25:48.2149872Z 
2025-05-07T20:25:48.2149883Z 
2025-05-07T20:25:48.2738133Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.2738815Z 
2025-05-07T20:25:48.2738821Z 
2025-05-07T20:25:48.2738826Z 
2025-05-07T20:25:48.2738832Z 
2025-05-07T20:25:48.2738837Z 
2025-05-07T20:25:48.2738842Z 
2025-05-07T20:25:48.2738848Z 
2025-05-07T20:25:48.2738865Z 
2025-05-07T20:25:48.2738871Z 
2025-05-07T20:25:48.2738877Z 
2025-05-07T20:25:48.2738882Z 
2025-05-07T20:25:48.2738919Z 
2025-05-07T20:25:48.2738926Z 
2025-05-07T20:25:48.2738932Z 
2025-05-07T20:25:48.2738938Z 
2025-05-07T20:25:48.2738944Z 
2025-05-07T20:25:48.2738948Z 
2025-05-07T20:25:48.2738955Z 
2025-05-07T20:25:48.2739260Z 
2025-05-07T20:25:48.3295180Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.3295525Z 
2025-05-07T20:25:48.3295531Z 
2025-05-07T20:25:48.3295537Z 
2025-05-07T20:25:48.3295542Z 
2025-05-07T20:25:48.3295548Z 
2025-05-07T20:25:48.3295554Z 
2025-05-07T20:25:48.3295593Z 
2025-05-07T20:25:48.3295598Z 
2025-05-07T20:25:48.3295603Z 
2025-05-07T20:25:48.3295609Z 
2025-05-07T20:25:48.3295614Z 
2025-05-07T20:25:48.3295619Z 
2025-05-07T20:25:48.3295625Z 
2025-05-07T20:25:48.3295630Z 
2025-05-07T20:25:48.3295647Z 
2025-05-07T20:25:48.3295653Z 
2025-05-07T20:25:48.3295658Z 
2025-05-07T20:25:48.3295665Z 
2025-05-07T20:25:48.4258117Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########7  |  88% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.4258644Z 
2025-05-07T20:25:48.4258649Z 
2025-05-07T20:25:48.4258653Z 
2025-05-07T20:25:48.4258657Z 
2025-05-07T20:25:48.4258661Z 
2025-05-07T20:25:48.4258664Z 
2025-05-07T20:25:48.4258668Z 
2025-05-07T20:25:48.4258672Z 
2025-05-07T20:25:48.4258676Z 
2025-05-07T20:25:48.4258680Z 
2025-05-07T20:25:48.4258684Z 
2025-05-07T20:25:48.4258687Z 
2025-05-07T20:25:48.4258691Z 
2025-05-07T20:25:48.4258695Z 
2025-05-07T20:25:48.4258699Z 
2025-05-07T20:25:48.4258703Z 
2025-05-07T20:25:48.4258706Z 
2025-05-07T20:25:48.4258711Z 
2025-05-07T20:25:48.4267167Z 
2025-05-07T20:25:48.6374391Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.6374797Z 
2025-05-07T20:25:48.6374814Z 
2025-05-07T20:25:48.6374820Z 
2025-05-07T20:25:48.6374836Z 
2025-05-07T20:25:48.6374841Z 
2025-05-07T20:25:48.6374846Z 
2025-05-07T20:25:48.6374852Z 
2025-05-07T20:25:48.6374857Z 
2025-05-07T20:25:48.6374863Z 
2025-05-07T20:25:48.6374868Z 
2025-05-07T20:25:48.6374874Z 
2025-05-07T20:25:48.6374879Z 
2025-05-07T20:25:48.6375156Z 
2025-05-07T20:25:48.6375161Z 
2025-05-07T20:25:48.6375165Z 
2025-05-07T20:25:48.6375168Z 
2025-05-07T20:25:48.6375172Z 
2025-05-07T20:25:48.6377019Z 
2025-05-07T20:25:49.4270080Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.4270577Z 
2025-05-07T20:25:49.4270583Z 
2025-05-07T20:25:49.4270588Z 
2025-05-07T20:25:49.4270594Z 
2025-05-07T20:25:49.4270599Z 
2025-05-07T20:25:49.4270604Z 
2025-05-07T20:25:50.2670689Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:50.2671269Z 
2025-05-07T20:25:50.2671273Z 
2025-05-07T20:25:50.2671277Z 
2025-05-07T20:25:50.2671280Z 
2025-05-07T20:25:50.2671288Z 
2025-05-07T20:25:50.8855659Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:50.8855972Z 
2025-05-07T20:25:50.8855976Z 
2025-05-07T20:25:50.8855989Z 
2025-05-07T20:25:50.8855993Z 
2025-05-07T20:25:50.8855997Z 
2025-05-07T20:25:50.8856001Z 
2025-05-07T20:25:50.8856030Z 
2025-05-07T20:25:50.8856034Z 
2025-05-07T20:25:50.8856038Z 
2025-05-07T20:25:50.9185765Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.9186068Z 
2025-05-07T20:25:50.9186073Z 
2025-05-07T20:25:50.9186077Z 
2025-05-07T20:25:50.9186081Z 
2025-05-07T20:25:50.9186084Z 
2025-05-07T20:25:50.9186088Z 
2025-05-07T20:25:50.9186480Z 
2025-05-07T20:25:51.2394967Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:51.2395315Z 
2025-05-07T20:25:51.2395331Z 
2025-05-07T20:25:51.2395334Z 
2025-05-07T20:25:51.2395338Z 
2025-05-07T20:25:51.2395342Z 
2025-05-07T20:25:51.2395345Z 
2025-05-07T20:25:51.2395349Z 
2025-05-07T20:25:51.2395352Z 
2025-05-07T20:25:51.3747420Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:51.4994426Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:51.4994711Z 
2025-05-07T20:25:51.4994716Z 
2025-05-07T20:25:51.4994742Z 
2025-05-07T20:25:51.4994755Z 
2025-05-07T20:25:51.4994759Z 
2025-05-07T20:25:51.4994763Z 
2025-05-07T20:25:51.4994768Z 
2025-05-07T20:25:51.4994771Z 
2025-05-07T20:25:51.4994775Z 
2025-05-07T20:25:51.4994779Z 
2025-05-07T20:25:51.4994783Z 
2025-05-07T20:25:51.4994787Z 
2025-05-07T20:25:51.5599133Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.5599471Z 
2025-05-07T20:25:51.5599475Z 
2025-05-07T20:25:51.5599479Z 
2025-05-07T20:25:51.5599483Z 
2025-05-07T20:25:51.5599515Z 
2025-05-07T20:25:51.5599521Z 
2025-05-07T20:25:51.5599524Z 
2025-05-07T20:25:51.5599528Z 
2025-05-07T20:25:51.5599532Z 
2025-05-07T20:25:51.5599536Z 
2025-05-07T20:25:51.5599539Z 
2025-05-07T20:25:51.5599543Z 
2025-05-07T20:25:51.5599550Z 
2025-05-07T20:25:51.7847323Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.7847704Z 
2025-05-07T20:25:51.7847709Z 
2025-05-07T20:25:51.7847713Z 
2025-05-07T20:25:51.7847754Z 
2025-05-07T20:25:51.7847758Z 
2025-05-07T20:25:51.7847762Z 
2025-05-07T20:25:51.7847766Z 
2025-05-07T20:25:51.7847770Z 
2025-05-07T20:25:51.7847774Z 
2025-05-07T20:25:51.7847778Z 
2025-05-07T20:25:52.1120569Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.1120900Z 
2025-05-07T20:25:52.1120905Z 
2025-05-07T20:25:52.1120910Z 
2025-05-07T20:25:52.1120914Z 
2025-05-07T20:25:52.1120918Z 
2025-05-07T20:25:52.1120922Z 
2025-05-07T20:25:52.1120927Z 
2025-05-07T20:25:52.1120964Z 
2025-05-07T20:25:52.1120968Z 
2025-05-07T20:25:52.1120972Z 
2025-05-07T20:25:52.1120975Z 
2025-05-07T20:25:52.1120979Z 
2025-05-07T20:25:52.1120983Z 
2025-05-07T20:25:52.1120987Z 
2025-05-07T20:25:52.1120991Z 
2025-05-07T20:25:52.4108885Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.4109238Z 
2025-05-07T20:25:52.4109242Z 
2025-05-07T20:25:52.4109246Z 
2025-05-07T20:25:52.4109250Z 
2025-05-07T20:25:52.4109503Z 
2025-05-07T20:25:52.4109508Z 
2025-05-07T20:25:52.4109512Z 
2025-05-07T20:25:52.4109525Z 
2025-05-07T20:25:52.4109529Z 
2025-05-07T20:25:52.4109532Z 
2025-05-07T20:25:52.4109536Z 
2025-05-07T20:25:52.4109540Z 
2025-05-07T20:25:52.4109545Z 
2025-05-07T20:25:52.4109549Z 
2025-05-07T20:25:52.4109554Z 
2025-05-07T20:25:52.4109558Z 
2025-05-07T20:25:52.4109562Z 
2025-05-07T20:25:52.6477113Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.6477495Z 
2025-05-07T20:25:52.6477784Z 
2025-05-07T20:25:52.6477788Z 
2025-05-07T20:25:52.6477792Z 
2025-05-07T20:25:52.6477796Z 
2025-05-07T20:25:52.6477799Z 
2025-05-07T20:25:52.6477804Z 
2025-05-07T20:25:52.6477808Z 
2025-05-07T20:25:52.6477811Z 
2025-05-07T20:25:52.6477815Z 
2025-05-07T20:25:52.6477819Z 
2025-05-07T20:25:52.6477823Z 
2025-05-07T20:25:52.6477826Z 
2025-05-07T20:25:52.6477830Z 
2025-05-07T20:25:52.6477834Z 
2025-05-07T20:25:52.6477838Z 
2025-05-07T20:25:52.8995331Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.8995692Z 
2025-05-07T20:25:52.8995697Z 
2025-05-07T20:25:52.8995701Z 
2025-05-07T20:25:52.8995704Z 
2025-05-07T20:25:52.8995708Z 
2025-05-07T20:25:52.8995712Z 
2025-05-07T20:25:52.8995716Z 
2025-05-07T20:25:52.8995719Z 
2025-05-07T20:25:52.8995723Z 
2025-05-07T20:25:52.8995727Z 
2025-05-07T20:25:52.8995731Z 
2025-05-07T20:25:52.8995734Z 
2025-05-07T20:25:52.8995738Z 
2025-05-07T20:25:52.8996476Z 
2025-05-07T20:25:53.1841407Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.1841751Z 
2025-05-07T20:25:53.1841755Z 
2025-05-07T20:25:53.1841768Z 
2025-05-07T20:25:53.1841772Z 
2025-05-07T20:25:53.1841776Z 
2025-05-07T20:25:53.1841780Z 
2025-05-07T20:25:53.1841784Z 
2025-05-07T20:25:53.1841789Z 
2025-05-07T20:25:53.1841793Z 
2025-05-07T20:25:53.1841797Z 
2025-05-07T20:25:53.1841801Z 
2025-05-07T20:25:53.2097651Z python-3.11.8        | 29.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.2097945Z 
2025-05-07T20:25:53.2097950Z 
2025-05-07T20:25:53.2097953Z 
2025-05-07T20:25:53.2097957Z 
2025-05-07T20:25:53.2097961Z 
2025-05-07T20:25:53.2097965Z 
2025-05-07T20:25:53.2097970Z 
2025-05-07T20:25:53.2097973Z 
2025-05-07T20:25:53.2097977Z 
2025-05-07T20:25:53.2097981Z 
2025-05-07T20:25:53.2097985Z 
2025-05-07T20:25:53.2097989Z 
2025-05-07T20:25:53.2097993Z 
2025-05-07T20:25:53.2097997Z 
2025-05-07T20:25:53.2098002Z 
2025-05-07T20:25:53.2098005Z 
2025-05-07T20:25:53.2098021Z 
2025-05-07T20:25:53.2098025Z 
2025-05-07T20:25:53.2098038Z 
2025-05-07T20:25:53.5451294Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.5451600Z 
2025-05-07T20:25:53.5451610Z 
2025-05-07T20:25:53.5451614Z 
2025-05-07T20:25:53.5451617Z 
2025-05-07T20:25:53.5451621Z 
2025-05-07T20:25:53.5451624Z 
2025-05-07T20:25:53.5451629Z 
2025-05-07T20:25:53.5451632Z 
2025-05-07T20:25:53.5451636Z 
2025-05-07T20:25:53.5451673Z 
2025-05-07T20:25:53.5451676Z 
2025-05-07T20:25:53.5451680Z 
2025-05-07T20:25:53.5451684Z 
2025-05-07T20:25:53.5451687Z 
2025-05-07T20:25:53.5451691Z 
2025-05-07T20:25:53.5451695Z 
2025-05-07T20:25:53.5451698Z 
2025-05-07T20:25:53.5451702Z 
2025-05-07T20:25:53.7835292Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.7835651Z 
2025-05-07T20:25:59.2081411Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:59.2090234Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:59.2090685Z 
2025-05-07T20:25:59.2090692Z 
2025-05-07T20:25:59.2090699Z 
2025-05-07T20:25:59.2090706Z 
2025-05-07T20:25:59.2090713Z 
2025-05-07T20:25:59.2090719Z 
2025-05-07T20:25:59.2090738Z 
2025-05-07T20:25:59.2090746Z 
2025-05-07T20:25:59.2090752Z 
2025-05-07T20:25:59.2090758Z 
2025-05-07T20:25:59.2090764Z 
2025-05-07T20:25:59.2090771Z 
2025-05-07T20:25:59.2090777Z 
2025-05-07T20:25:59.2090783Z 
2025-05-07T20:25:59.2091521Z 
2025-05-07T20:25:59.2091530Z 
2025-05-07T20:25:59.2091536Z 
2025-05-07T20:25:59.2091541Z 
2025-05-07T20:25:59.2091547Z 
2025-05-07T20:25:59.2091704Z                       
2025-05-07T20:25:59.2092370Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2092967Z                                                      
2025-05-07T20:25:59.2093380Z 
2025-05-07T20:25:59.2093664Z                                                      [A
2025-05-07T20:25:59.2094079Z 
2025-05-07T20:25:59.2094089Z 
2025-05-07T20:25:59.2094610Z                                                      [A[A
2025-05-07T20:25:59.2095052Z 
2025-05-07T20:25:59.2095061Z 
2025-05-07T20:25:59.2095069Z 
2025-05-07T20:25:59.2095381Z                                                      [A[A[A
2025-05-07T20:25:59.2095751Z 
2025-05-07T20:25:59.2095758Z 
2025-05-07T20:25:59.2095766Z 
2025-05-07T20:25:59.2095775Z 
2025-05-07T20:25:59.2096122Z                                                      [A[A[A[A
2025-05-07T20:25:59.2096556Z 
2025-05-07T20:25:59.2096579Z 
2025-05-07T20:25:59.2096586Z 
2025-05-07T20:25:59.2096593Z 
2025-05-07T20:25:59.2096601Z 
2025-05-07T20:25:59.2096972Z                                                      [A[A[A[A[A
2025-05-07T20:25:59.2097411Z 
2025-05-07T20:25:59.2097431Z 
2025-05-07T20:25:59.2097439Z 
2025-05-07T20:25:59.2097445Z 
2025-05-07T20:25:59.2097453Z 
2025-05-07T20:25:59.2097462Z 
2025-05-07T20:25:59.2097766Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:59.2098150Z 
2025-05-07T20:25:59.2098169Z 
2025-05-07T20:25:59.2098175Z 
2025-05-07T20:25:59.2098180Z 
2025-05-07T20:25:59.2098186Z 
2025-05-07T20:25:59.2098192Z 
2025-05-07T20:25:59.2098198Z 
2025-05-07T20:25:59.2098476Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:59.2098841Z 
2025-05-07T20:25:59.2098847Z 
2025-05-07T20:25:59.2098853Z 
2025-05-07T20:25:59.2098859Z 
2025-05-07T20:25:59.2098865Z 
2025-05-07T20:25:59.2098871Z 
2025-05-07T20:25:59.2098884Z 
2025-05-07T20:25:59.2098890Z 
2025-05-07T20:25:59.2099194Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2099536Z 
2025-05-07T20:25:59.2099542Z 
2025-05-07T20:25:59.2099548Z 
2025-05-07T20:25:59.2099554Z 
2025-05-07T20:25:59.2099560Z 
2025-05-07T20:25:59.2099566Z 
2025-05-07T20:25:59.2099572Z 
2025-05-07T20:25:59.2099578Z 
2025-05-07T20:25:59.2099584Z 
2025-05-07T20:25:59.2100053Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2100323Z 
2025-05-07T20:25:59.2100328Z 
2025-05-07T20:25:59.2100331Z 
2025-05-07T20:25:59.2100335Z 
2025-05-07T20:25:59.2100339Z 
2025-05-07T20:25:59.2100353Z 
2025-05-07T20:25:59.2100356Z 
2025-05-07T20:25:59.2100360Z 
2025-05-07T20:25:59.2100364Z 
2025-05-07T20:25:59.2100367Z 
2025-05-07T20:25:59.2100560Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2100786Z 
2025-05-07T20:25:59.2100796Z 
2025-05-07T20:25:59.2100806Z 
2025-05-07T20:25:59.2100810Z 
2025-05-07T20:25:59.2100813Z 
2025-05-07T20:25:59.2100817Z 
2025-05-07T20:25:59.2100821Z 
2025-05-07T20:25:59.2100825Z 
2025-05-07T20:25:59.2100828Z 
2025-05-07T20:25:59.2100832Z 
2025-05-07T20:25:59.2100835Z 
2025-05-07T20:25:59.2101410Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2101783Z 
2025-05-07T20:25:59.2101789Z 
2025-05-07T20:25:59.2101795Z 
2025-05-07T20:25:59.2101801Z 
2025-05-07T20:25:59.2101807Z 
2025-05-07T20:25:59.2101823Z 
2025-05-07T20:25:59.2101828Z 
2025-05-07T20:25:59.2101834Z 
2025-05-07T20:25:59.2101840Z 
2025-05-07T20:25:59.2101846Z 
2025-05-07T20:25:59.2101852Z 
2025-05-07T20:25:59.2101857Z 
2025-05-07T20:25:59.2102225Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2102605Z 
2025-05-07T20:25:59.2102610Z 
2025-05-07T20:25:59.2102616Z 
2025-05-07T20:25:59.2102621Z 
2025-05-07T20:25:59.2102626Z 
2025-05-07T20:25:59.2102764Z 
2025-05-07T20:25:59.2102772Z 
2025-05-07T20:25:59.2102777Z 
2025-05-07T20:25:59.2102793Z 
2025-05-07T20:25:59.2102799Z 
2025-05-07T20:25:59.2102805Z 
2025-05-07T20:25:59.2102810Z 
2025-05-07T20:25:59.2102816Z 
2025-05-07T20:25:59.2103164Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2103524Z 
2025-05-07T20:25:59.2103540Z 
2025-05-07T20:25:59.2103546Z 
2025-05-07T20:25:59.2103559Z 
2025-05-07T20:25:59.2103565Z 
2025-05-07T20:25:59.2103570Z 
2025-05-07T20:25:59.2103686Z 
2025-05-07T20:25:59.2103692Z 
2025-05-07T20:25:59.2103698Z 
2025-05-07T20:25:59.2103704Z 
2025-05-07T20:25:59.2103709Z 
2025-05-07T20:25:59.2103715Z 
2025-05-07T20:25:59.2103721Z 
2025-05-07T20:25:59.2103727Z 
2025-05-07T20:25:59.2104157Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2104408Z 
2025-05-07T20:25:59.2104413Z 
2025-05-07T20:25:59.2104416Z 
2025-05-07T20:25:59.2104420Z 
2025-05-07T20:25:59.2104435Z 
2025-05-07T20:25:59.2104439Z 
2025-05-07T20:25:59.2104442Z 
2025-05-07T20:25:59.2104446Z 
2025-05-07T20:25:59.2104450Z 
2025-05-07T20:25:59.2104463Z 
2025-05-07T20:25:59.2104466Z 
2025-05-07T20:25:59.2104470Z 
2025-05-07T20:25:59.2104474Z 
2025-05-07T20:25:59.2104478Z 
2025-05-07T20:25:59.2104481Z 
2025-05-07T20:25:59.2104928Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2105291Z 
2025-05-07T20:25:59.2105297Z 
2025-05-07T20:25:59.2105303Z 
2025-05-07T20:25:59.2105319Z 
2025-05-07T20:25:59.2105325Z 
2025-05-07T20:25:59.2105331Z 
2025-05-07T20:25:59.2105337Z 
2025-05-07T20:25:59.2105343Z 
2025-05-07T20:25:59.2105349Z 
2025-05-07T20:25:59.2105355Z 
2025-05-07T20:25:59.2105361Z 
2025-05-07T20:25:59.2105367Z 
2025-05-07T20:25:59.2105373Z 
2025-05-07T20:25:59.2105388Z 
2025-05-07T20:25:59.2105395Z 
2025-05-07T20:25:59.2105412Z 
2025-05-07T20:25:59.2105774Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2106142Z 
2025-05-07T20:25:59.2106157Z 
2025-05-07T20:25:59.2106163Z 
2025-05-07T20:25:59.2106169Z 
2025-05-07T20:25:59.2106175Z 
2025-05-07T20:25:59.2106180Z 
2025-05-07T20:25:59.2106186Z 
2025-05-07T20:25:59.2106192Z 
2025-05-07T20:25:59.2106198Z 
2025-05-07T20:25:59.2106204Z 
2025-05-07T20:25:59.2106210Z 
2025-05-07T20:25:59.2106216Z 
2025-05-07T20:25:59.2106222Z 
2025-05-07T20:25:59.2106227Z 
2025-05-07T20:25:59.2106233Z 
2025-05-07T20:25:59.2106239Z 
2025-05-07T20:25:59.2106253Z 
2025-05-07T20:25:59.2106633Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2107006Z 
2025-05-07T20:25:59.2107013Z 
2025-05-07T20:25:59.2107019Z 
2025-05-07T20:25:59.2107025Z 
2025-05-07T20:25:59.2107031Z 
2025-05-07T20:25:59.2107037Z 
2025-05-07T20:25:59.2107043Z 
2025-05-07T20:25:59.2107049Z 
2025-05-07T20:25:59.2107055Z 
2025-05-07T20:25:59.2107061Z 
2025-05-07T20:25:59.2107067Z 
2025-05-07T20:25:59.2107082Z 
2025-05-07T20:25:59.2107096Z 
2025-05-07T20:25:59.2107103Z 
2025-05-07T20:25:59.2107108Z 
2025-05-07T20:25:59.2107114Z 
2025-05-07T20:25:59.2107120Z 
2025-05-07T20:25:59.2107126Z 
2025-05-07T20:25:59.2108329Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2108721Z 
2025-05-07T20:25:59.2108727Z 
2025-05-07T20:25:59.2108897Z [A
2025-05-07T20:25:59.2109063Z 
2025-05-07T20:25:59.2109069Z 
2025-05-07T20:25:59.2109756Z [A[A
2025-05-07T20:25:59.2109946Z 
2025-05-07T20:25:59.2109952Z 
2025-05-07T20:25:59.2109962Z 
2025-05-07T20:25:59.2110455Z [A[A[A
2025-05-07T20:25:59.2110637Z 
2025-05-07T20:25:59.2110644Z 
2025-05-07T20:25:59.2110654Z 
2025-05-07T20:25:59.2110660Z 
2025-05-07T20:25:59.2111415Z [A[A[A[A
2025-05-07T20:25:59.2111599Z 
2025-05-07T20:25:59.2111605Z 
2025-05-07T20:25:59.2111610Z 
2025-05-07T20:25:59.2111615Z 
2025-05-07T20:25:59.2111624Z 
2025-05-07T20:25:59.2112435Z [A[A[A[A[A
2025-05-07T20:25:59.2112776Z 
2025-05-07T20:25:59.2112784Z 
2025-05-07T20:25:59.2112789Z 
2025-05-07T20:25:59.2112794Z 
2025-05-07T20:25:59.2112799Z 
2025-05-07T20:25:59.2112809Z 
2025-05-07T20:25:59.2113311Z [A[A[A[A[A[A
2025-05-07T20:25:59.2113520Z 
2025-05-07T20:25:59.2113531Z 
2025-05-07T20:25:59.2113535Z 
2025-05-07T20:25:59.2113545Z 
2025-05-07T20:25:59.2113548Z 
2025-05-07T20:25:59.2113552Z 
2025-05-07T20:25:59.2113556Z 
2025-05-07T20:25:59.2114012Z [A[A[A[A[A[A[A
2025-05-07T20:25:59.2114248Z 
2025-05-07T20:25:59.2114254Z 
2025-05-07T20:25:59.2114380Z 
2025-05-07T20:25:59.2114386Z 
2025-05-07T20:25:59.2114391Z 
2025-05-07T20:25:59.2114396Z 
2025-05-07T20:25:59.2114402Z 
2025-05-07T20:25:59.2114412Z 
2025-05-07T20:25:59.2114955Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2115201Z 
2025-05-07T20:25:59.2115207Z 
2025-05-07T20:25:59.2115212Z 
2025-05-07T20:25:59.2115218Z 
2025-05-07T20:25:59.2115224Z 
2025-05-07T20:25:59.2115230Z 
2025-05-07T20:25:59.2115235Z 
2025-05-07T20:25:59.2115240Z 
2025-05-07T20:25:59.2115250Z 
2025-05-07T20:25:59.2115655Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2115918Z 
2025-05-07T20:25:59.2115925Z 
2025-05-07T20:25:59.2115931Z 
2025-05-07T20:25:59.2115936Z 
2025-05-07T20:25:59.2115942Z 
2025-05-07T20:25:59.2115948Z 
2025-05-07T20:25:59.2115953Z 
2025-05-07T20:25:59.2115959Z 
2025-05-07T20:25:59.2115974Z 
2025-05-07T20:25:59.2115984Z 
2025-05-07T20:25:59.2116623Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2116917Z 
2025-05-07T20:25:59.2116922Z 
2025-05-07T20:25:59.2116927Z 
2025-05-07T20:25:59.2116960Z 
2025-05-07T20:25:59.2116966Z 
2025-05-07T20:25:59.2116971Z 
2025-05-07T20:25:59.2116976Z 
2025-05-07T20:25:59.2116981Z 
2025-05-07T20:25:59.2116986Z 
2025-05-07T20:25:59.2116991Z 
2025-05-07T20:25:59.2116996Z 
2025-05-07T20:25:59.2117385Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2117581Z 
2025-05-07T20:25:59.2117585Z 
2025-05-07T20:25:59.2117588Z 
2025-05-07T20:25:59.2117600Z 
2025-05-07T20:25:59.2117603Z 
2025-05-07T20:25:59.2117607Z 
2025-05-07T20:25:59.2117611Z 
2025-05-07T20:25:59.2117626Z 
2025-05-07T20:25:59.2117629Z 
2025-05-07T20:25:59.2117633Z 
2025-05-07T20:25:59.2117637Z 
2025-05-07T20:25:59.2117644Z 
2025-05-07T20:25:59.2117970Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2118225Z 
2025-05-07T20:25:59.2118235Z 
2025-05-07T20:25:59.2118239Z 
2025-05-07T20:25:59.2118242Z 
2025-05-07T20:25:59.2118246Z 
2025-05-07T20:25:59.2118250Z 
2025-05-07T20:25:59.2118254Z 
2025-05-07T20:25:59.2118258Z 
2025-05-07T20:25:59.2118262Z 
2025-05-07T20:25:59.2118265Z 
2025-05-07T20:25:59.2118275Z 
2025-05-07T20:25:59.2118279Z 
2025-05-07T20:25:59.2118282Z 
2025-05-07T20:25:59.2118835Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2119165Z 
2025-05-07T20:25:59.2119170Z 
2025-05-07T20:25:59.2119176Z 
2025-05-07T20:25:59.2119181Z 
2025-05-07T20:25:59.2119186Z 
2025-05-07T20:25:59.2119198Z 
2025-05-07T20:25:59.2119204Z 
2025-05-07T20:25:59.2119209Z 
2025-05-07T20:25:59.2119214Z 
2025-05-07T20:25:59.2119219Z 
2025-05-07T20:25:59.2119224Z 
2025-05-07T20:25:59.2119239Z 
2025-05-07T20:25:59.2119245Z 
2025-05-07T20:25:59.2119250Z 
2025-05-07T20:25:59.2119597Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2119828Z 
2025-05-07T20:25:59.2119832Z 
2025-05-07T20:25:59.2119843Z 
2025-05-07T20:25:59.2119847Z 
2025-05-07T20:25:59.2119851Z 
2025-05-07T20:25:59.2119854Z 
2025-05-07T20:25:59.2119858Z 
2025-05-07T20:25:59.2119862Z 
2025-05-07T20:25:59.2119866Z 
2025-05-07T20:25:59.2119869Z 
2025-05-07T20:25:59.2119873Z 
2025-05-07T20:25:59.2119876Z 
2025-05-07T20:25:59.2119884Z 
2025-05-07T20:25:59.2119898Z 
2025-05-07T20:25:59.2119902Z 
2025-05-07T20:25:59.2120361Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2120698Z 
2025-05-07T20:25:59.2120703Z 
2025-05-07T20:25:59.2120709Z 
2025-05-07T20:25:59.2120714Z 
2025-05-07T20:25:59.2120728Z 
2025-05-07T20:25:59.2120733Z 
2025-05-07T20:25:59.2120739Z 
2025-05-07T20:25:59.2120744Z 
2025-05-07T20:25:59.2120749Z 
2025-05-07T20:25:59.2120755Z 
2025-05-07T20:25:59.2120770Z 
2025-05-07T20:25:59.2120776Z 
2025-05-07T20:25:59.2120916Z 
2025-05-07T20:25:59.2120925Z 
2025-05-07T20:25:59.2120931Z 
2025-05-07T20:25:59.2120937Z 
2025-05-07T20:25:59.2121462Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2121795Z 
2025-05-07T20:25:59.2121801Z 
2025-05-07T20:25:59.2121807Z 
2025-05-07T20:25:59.2121813Z 
2025-05-07T20:25:59.2121819Z 
2025-05-07T20:25:59.2121825Z 
2025-05-07T20:25:59.2121831Z 
2025-05-07T20:25:59.2121836Z 
2025-05-07T20:25:59.2121842Z 
2025-05-07T20:25:59.2121848Z 
2025-05-07T20:25:59.2121854Z 
2025-05-07T20:25:59.2122009Z 
2025-05-07T20:25:59.2122015Z 
2025-05-07T20:25:59.2122032Z 
2025-05-07T20:25:59.2122038Z 
2025-05-07T20:25:59.2122043Z 
2025-05-07T20:25:59.2122048Z 
2025-05-07T20:25:59.2122324Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2122693Z 
2025-05-07T20:25:59.2122700Z 
2025-05-07T20:25:59.2122706Z 
2025-05-07T20:25:59.2122712Z 
2025-05-07T20:25:59.2122718Z 
2025-05-07T20:25:59.2122723Z 
2025-05-07T20:25:59.2122729Z 
2025-05-07T20:25:59.2122747Z 
2025-05-07T20:25:59.2122753Z 
2025-05-07T20:25:59.2122759Z 
2025-05-07T20:25:59.2122765Z 
2025-05-07T20:25:59.2122771Z 
2025-05-07T20:25:59.2122777Z 
2025-05-07T20:25:59.2122782Z 
2025-05-07T20:25:59.2122788Z 
2025-05-07T20:25:59.2122794Z 
2025-05-07T20:25:59.2122800Z 
2025-05-07T20:25:59.2122805Z 
2025-05-07T20:25:59.2123741Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2124076Z 
2025-05-07T20:25:59.2124088Z 
2025-05-07T20:25:59.2124266Z [A
2025-05-07T20:25:59.2124444Z 
2025-05-07T20:25:59.2124465Z 
2025-05-07T20:25:59.2125034Z [A[A
2025-05-07T20:25:59.2125166Z 
2025-05-07T20:25:59.2125173Z 
2025-05-07T20:25:59.2125177Z 
2025-05-07T20:25:59.2125624Z [A[A[A
2025-05-07T20:25:59.2125739Z 
2025-05-07T20:25:59.2125743Z 
2025-05-07T20:25:59.2125749Z 
2025-05-07T20:25:59.2125804Z 
2025-05-07T20:25:59.2126464Z [A[A[A[A
2025-05-07T20:25:59.2126633Z 
2025-05-07T20:25:59.2126638Z 
2025-05-07T20:25:59.2126643Z 
2025-05-07T20:25:59.2126647Z 
2025-05-07T20:25:59.2126652Z 
2025-05-07T20:25:59.2127121Z [A[A[A[A[A
2025-05-07T20:25:59.2127303Z 
2025-05-07T20:25:59.2127307Z 
2025-05-07T20:25:59.2127314Z 
2025-05-07T20:25:59.2127318Z 
2025-05-07T20:25:59.2127322Z 
2025-05-07T20:25:59.2127325Z 
2025-05-07T20:25:59.2127758Z [A[A[A[A[A[A
2025-05-07T20:25:59.2127935Z 
2025-05-07T20:25:59.2127944Z 
2025-05-07T20:25:59.2127948Z 
2025-05-07T20:25:59.2127952Z 
2025-05-07T20:25:59.2127956Z 
2025-05-07T20:25:59.2127960Z 
2025-05-07T20:25:59.2127963Z 
2025-05-07T20:25:59.2128370Z [A[A[A[A[A[A[A
2025-05-07T20:25:59.2128526Z 
2025-05-07T20:25:59.2128534Z 
2025-05-07T20:25:59.2128538Z 
2025-05-07T20:25:59.2128541Z 
2025-05-07T20:25:59.2128545Z 
2025-05-07T20:25:59.2128549Z 
2025-05-07T20:25:59.2128553Z 
2025-05-07T20:25:59.2128556Z 
2025-05-07T20:25:59.2137532Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2137789Z 
2025-05-07T20:25:59.2137821Z 
2025-05-07T20:25:59.2137825Z 
2025-05-07T20:25:59.2137829Z 
2025-05-07T20:25:59.2137832Z 
2025-05-07T20:25:59.2137836Z 
2025-05-07T20:25:59.2137849Z 
2025-05-07T20:25:59.2137853Z 
2025-05-07T20:25:59.2137870Z 
2025-05-07T20:25:59.2138119Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2138340Z 
2025-05-07T20:25:59.2138347Z 
2025-05-07T20:25:59.2138352Z 
2025-05-07T20:25:59.2138357Z 
2025-05-07T20:25:59.2138617Z 
2025-05-07T20:25:59.2138624Z 
2025-05-07T20:25:59.2138630Z 
2025-05-07T20:25:59.2138647Z 
2025-05-07T20:25:59.2138652Z 
2025-05-07T20:25:59.2138657Z 
2025-05-07T20:25:59.2138858Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2139083Z 
2025-05-07T20:25:59.2139096Z 
2025-05-07T20:25:59.2139102Z 
2025-05-07T20:25:59.2139107Z 
2025-05-07T20:25:59.2139122Z 
2025-05-07T20:25:59.2139127Z 
2025-05-07T20:25:59.2139132Z 
2025-05-07T20:25:59.2139137Z 
2025-05-07T20:25:59.2139142Z 
2025-05-07T20:25:59.2139147Z 
2025-05-07T20:25:59.2139152Z 
2025-05-07T20:25:59.2139339Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2139527Z 
2025-05-07T20:25:59.2139531Z 
2025-05-07T20:25:59.2139535Z 
2025-05-07T20:25:59.2139538Z 
2025-05-07T20:25:59.2139976Z 
2025-05-07T20:25:59.2139981Z 
2025-05-07T20:25:59.2139984Z 
2025-05-07T20:25:59.2139988Z 
2025-05-07T20:25:59.2139992Z 
2025-05-07T20:25:59.2139995Z 
2025-05-07T20:25:59.2139999Z 
2025-05-07T20:25:59.2140002Z 
2025-05-07T20:25:59.2140181Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2140455Z 
2025-05-07T20:25:59.2140461Z 
2025-05-07T20:25:59.2140466Z 
2025-05-07T20:25:59.2140471Z 
2025-05-07T20:25:59.2140476Z 
2025-05-07T20:25:59.2140482Z 
2025-05-07T20:25:59.2140487Z 
2025-05-07T20:25:59.2140492Z 
2025-05-07T20:25:59.2140661Z 
2025-05-07T20:25:59.2140666Z 
2025-05-07T20:25:59.2140671Z 
2025-05-07T20:25:59.2140675Z 
2025-05-07T20:25:59.2140680Z 
2025-05-07T20:25:59.2140892Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2141148Z 
2025-05-07T20:25:59.2141153Z 
2025-05-07T20:25:59.2141158Z 
2025-05-07T20:25:59.2141164Z 
2025-05-07T20:25:59.2141168Z 
2025-05-07T20:25:59.2141173Z 
2025-05-07T20:25:59.2141179Z 
2025-05-07T20:25:59.2141184Z 
2025-05-07T20:25:59.2141189Z 
2025-05-07T20:25:59.2141201Z 
2025-05-07T20:25:59.2141277Z 
2025-05-07T20:25:59.2141282Z 
2025-05-07T20:25:59.2141287Z 
2025-05-07T20:25:59.2141302Z 
2025-05-07T20:25:59.2141495Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2141756Z 
2025-05-07T20:25:59.2141762Z 
2025-05-07T20:25:59.2141767Z 
2025-05-07T20:25:59.2141772Z 
2025-05-07T20:25:59.2141777Z 
2025-05-07T20:25:59.2141782Z 
2025-05-07T20:25:59.2141795Z 
2025-05-07T20:25:59.2141800Z 
2025-05-07T20:25:59.2141805Z 
2025-05-07T20:25:59.2141810Z 
2025-05-07T20:25:59.2141823Z 
2025-05-07T20:25:59.2141828Z 
2025-05-07T20:25:59.2141833Z 
2025-05-07T20:25:59.2141839Z 
2025-05-07T20:25:59.2141843Z 
2025-05-07T20:25:59.2142043Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2142322Z 
2025-05-07T20:25:59.2142327Z 
2025-05-07T20:25:59.2142332Z 
2025-05-07T20:25:59.2142337Z 
2025-05-07T20:25:59.2142342Z 
2025-05-07T20:25:59.2142347Z 
2025-05-07T20:25:59.2142352Z 
2025-05-07T20:25:59.2142357Z 
2025-05-07T20:25:59.2142362Z 
2025-05-07T20:25:59.2142374Z 
2025-05-07T20:25:59.2142379Z 
2025-05-07T20:25:59.2142384Z 
2025-05-07T20:25:59.2142389Z 
2025-05-07T20:25:59.2142394Z 
2025-05-07T20:25:59.2142399Z 
2025-05-07T20:25:59.2142404Z 
2025-05-07T20:25:59.2142616Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2142903Z 
2025-05-07T20:25:59.2142909Z 
2025-05-07T20:25:59.2142914Z 
2025-05-07T20:25:59.2142919Z 
2025-05-07T20:25:59.2142924Z 
2025-05-07T20:25:59.2142929Z 
2025-05-07T20:25:59.2142933Z 
2025-05-07T20:25:59.2142938Z 
2025-05-07T20:25:59.2142951Z 
2025-05-07T20:25:59.2142956Z 
2025-05-07T20:25:59.2142961Z 
2025-05-07T20:25:59.2142966Z 
2025-05-07T20:25:59.2142971Z 
2025-05-07T20:25:59.2142986Z 
2025-05-07T20:25:59.2142991Z 
2025-05-07T20:25:59.2142996Z 
2025-05-07T20:25:59.2143001Z 
2025-05-07T20:25:59.2143216Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2143495Z 
2025-05-07T20:25:59.2143500Z 
2025-05-07T20:25:59.2143505Z 
2025-05-07T20:25:59.2143510Z 
2025-05-07T20:25:59.2143524Z 
2025-05-07T20:25:59.2143536Z 
2025-05-07T20:25:59.2143541Z 
2025-05-07T20:25:59.2143546Z 
2025-05-07T20:25:59.2143551Z 
2025-05-07T20:25:59.2143556Z 
2025-05-07T20:25:59.2143561Z 
2025-05-07T20:25:59.2143566Z 
2025-05-07T20:25:59.2143571Z 
2025-05-07T20:25:59.2143576Z 
2025-05-07T20:25:59.2143581Z 
2025-05-07T20:25:59.2143586Z 
2025-05-07T20:25:59.2143591Z 
2025-05-07T20:25:59.2143596Z 
2025-05-07T20:25:59.2143814Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2144110Z 
2025-05-07T20:25:59.2144116Z 
2025-05-07T20:25:59.2144252Z [A
2025-05-07T20:25:59.2144396Z 
2025-05-07T20:25:59.2144401Z 
2025-05-07T20:25:59.2144534Z [A[A
2025-05-07T20:25:59.2144675Z 
2025-05-07T20:25:59.2144680Z 
2025-05-07T20:25:59.2144685Z 
2025-05-07T20:25:59.2144831Z [A[A[A
2025-05-07T20:25:59.2144978Z 
2025-05-07T20:25:59.2144983Z 
2025-05-07T20:25:59.2144988Z 
2025-05-07T20:25:59.2144993Z 
2025-05-07T20:25:59.2145147Z [A[A[A[A
2025-05-07T20:25:59.2145298Z 
2025-05-07T20:25:59.2145303Z 
2025-05-07T20:25:59.2145417Z 
2025-05-07T20:25:59.2145423Z 
2025-05-07T20:25:59.2145428Z 
2025-05-07T20:25:59.2145582Z [A[A[A[A[A
2025-05-07T20:25:59.2145743Z 
2025-05-07T20:25:59.2145748Z 
2025-05-07T20:25:59.2145753Z 
2025-05-07T20:25:59.2145758Z 
2025-05-07T20:25:59.2145771Z 
2025-05-07T20:25:59.2145777Z 
2025-05-07T20:25:59.2145927Z [A[A[A[A[A[A
2025-05-07T20:25:59.2146096Z 
2025-05-07T20:25:59.2146101Z 
2025-05-07T20:25:59.2146106Z 
2025-05-07T20:25:59.2146111Z 
2025-05-07T20:25:59.2146117Z 
2025-05-07T20:25:59.2146122Z 
2025-05-07T20:25:59.2146231Z 
2025-05-07T20:25:59.2146388Z [A[A[A[A[A[A[A
2025-05-07T20:25:59.2146583Z 
2025-05-07T20:25:59.2146588Z 
2025-05-07T20:25:59.2146593Z 
2025-05-07T20:25:59.2146597Z 
2025-05-07T20:25:59.2146602Z 
2025-05-07T20:25:59.2146607Z 
2025-05-07T20:25:59.2146621Z 
2025-05-07T20:25:59.2146626Z 
2025-05-07T20:25:59.2146785Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2146984Z 
2025-05-07T20:25:59.2146990Z 
2025-05-07T20:25:59.2146995Z 
2025-05-07T20:25:59.2147000Z 
2025-05-07T20:25:59.2147012Z 
2025-05-07T20:25:59.2147017Z 
2025-05-07T20:25:59.2147032Z 
2025-05-07T20:25:59.2147037Z 
2025-05-07T20:25:59.2147042Z 
2025-05-07T20:25:59.2147205Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2147411Z 
2025-05-07T20:25:59.2147417Z 
2025-05-07T20:25:59.2147422Z 
2025-05-07T20:25:59.2147427Z 
2025-05-07T20:25:59.2147432Z 
2025-05-07T20:25:59.2147445Z 
2025-05-07T20:25:59.2147450Z 
2025-05-07T20:25:59.2147455Z 
2025-05-07T20:25:59.2147460Z 
2025-05-07T20:25:59.2147466Z 
2025-05-07T20:25:59.2147636Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2147858Z 
2025-05-07T20:25:59.2147863Z 
2025-05-07T20:25:59.2147876Z 
2025-05-07T20:25:59.2147881Z 
2025-05-07T20:25:59.2147886Z 
2025-05-07T20:25:59.2147892Z 
2025-05-07T20:25:59.2147897Z 
2025-05-07T20:25:59.2147902Z 
2025-05-07T20:25:59.2147907Z 
2025-05-07T20:25:59.2147912Z 
2025-05-07T20:25:59.2147917Z 
2025-05-07T20:25:59.2148091Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2148329Z 
2025-05-07T20:25:59.2148335Z 
2025-05-07T20:25:59.2148347Z 
2025-05-07T20:25:59.2148352Z 
2025-05-07T20:25:59.2148357Z 
2025-05-07T20:25:59.2148362Z 
2025-05-07T20:25:59.2148367Z 
2025-05-07T20:25:59.2148372Z 
2025-05-07T20:25:59.2148377Z 
2025-05-07T20:25:59.2148382Z 
2025-05-07T20:25:59.2148387Z 
2025-05-07T20:25:59.2148392Z 
2025-05-07T20:25:59.2148565Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2148815Z 
2025-05-07T20:25:59.2148821Z 
2025-05-07T20:25:59.2148826Z 
2025-05-07T20:25:59.2148831Z 
2025-05-07T20:25:59.2148836Z 
2025-05-07T20:25:59.2148841Z 
2025-05-07T20:25:59.2148851Z 
2025-05-07T20:25:59.2148856Z 
2025-05-07T20:25:59.2148861Z 
2025-05-07T20:25:59.2148866Z 
2025-05-07T20:25:59.2148871Z 
2025-05-07T20:25:59.2148876Z 
2025-05-07T20:25:59.2148881Z 
2025-05-07T20:25:59.2149068Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2149318Z 
2025-05-07T20:25:59.2149323Z 
2025-05-07T20:25:59.2149328Z 
2025-05-07T20:25:59.2149333Z 
2025-05-07T20:25:59.2149338Z 
2025-05-07T20:25:59.2149343Z 
2025-05-07T20:25:59.2149348Z 
2025-05-07T20:25:59.2149360Z 
2025-05-07T20:25:59.2149365Z 
2025-05-07T20:25:59.2149370Z 
2025-05-07T20:25:59.2149375Z 
2025-05-07T20:25:59.2149380Z 
2025-05-07T20:25:59.2149392Z 
2025-05-07T20:25:59.2149398Z 
2025-05-07T20:25:59.2149586Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2149846Z 
2025-05-07T20:25:59.2149851Z 
2025-05-07T20:25:59.2149857Z 
2025-05-07T20:25:59.2149862Z 
2025-05-07T20:25:59.2149867Z 
2025-05-07T20:25:59.2149880Z 
2025-05-07T20:25:59.2149885Z 
2025-05-07T20:25:59.2149891Z 
2025-05-07T20:25:59.2149901Z 
2025-05-07T20:25:59.2149907Z 
2025-05-07T20:25:59.2149912Z 
2025-05-07T20:25:59.2149917Z 
2025-05-07T20:25:59.2149922Z 
2025-05-07T20:25:59.2149927Z 
2025-05-07T20:25:59.2149932Z 
2025-05-07T20:25:59.2150124Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2150396Z 
2025-05-07T20:25:59.2150401Z 
2025-05-07T20:25:59.2150406Z 
2025-05-07T20:25:59.2150411Z 
2025-05-07T20:25:59.2150416Z 
2025-05-07T20:25:59.2150421Z 
2025-05-07T20:25:59.2150427Z 
2025-05-07T20:25:59.2150532Z 
2025-05-07T20:25:59.2150538Z 
2025-05-07T20:25:59.2150543Z 
2025-05-07T20:25:59.2150548Z 
2025-05-07T20:25:59.2150553Z 
2025-05-07T20:25:59.2150558Z 
2025-05-07T20:25:59.2150563Z 
2025-05-07T20:25:59.2150568Z 
2025-05-07T20:25:59.2150573Z 
2025-05-07T20:25:59.2150793Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2151069Z 
2025-05-07T20:25:59.2151074Z 
2025-05-07T20:25:59.2151079Z 
2025-05-07T20:25:59.2151084Z 
2025-05-07T20:25:59.2151089Z 
2025-05-07T20:25:59.2151094Z 
2025-05-07T20:25:59.2151216Z 
2025-05-07T20:25:59.2151221Z 
2025-05-07T20:25:59.2151227Z 
2025-05-07T20:25:59.2151232Z 
2025-05-07T20:25:59.2151237Z 
2025-05-07T20:25:59.2151242Z 
2025-05-07T20:25:59.2151247Z 
2025-05-07T20:25:59.2151261Z 
2025-05-07T20:25:59.2151267Z 
2025-05-07T20:25:59.2151272Z 
2025-05-07T20:25:59.2151277Z 
2025-05-07T20:25:59.2151496Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2151775Z 
2025-05-07T20:25:59.2151781Z 
2025-05-07T20:25:59.2151786Z 
2025-05-07T20:25:59.2151806Z 
2025-05-07T20:25:59.2151811Z 
2025-05-07T20:25:59.2151816Z 
2025-05-07T20:25:59.2151821Z 
2025-05-07T20:25:59.2151827Z 
2025-05-07T20:25:59.2151832Z 
2025-05-07T20:25:59.2151837Z 
2025-05-07T20:25:59.2151842Z 
2025-05-07T20:25:59.2151847Z 
2025-05-07T20:25:59.2151852Z 
2025-05-07T20:25:59.2151857Z 
2025-05-07T20:25:59.2151862Z 
2025-05-07T20:25:59.2151867Z 
2025-05-07T20:25:59.2151872Z 
2025-05-07T20:25:59.2151877Z 
2025-05-07T20:25:59.2152098Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2152396Z 
2025-05-07T20:25:59.2152402Z 
2025-05-07T20:25:59.2152533Z [A
2025-05-07T20:25:59.2152677Z 
2025-05-07T20:25:59.2152682Z 
2025-05-07T20:25:59.2152818Z [A[A
2025-05-07T20:25:59.2152962Z 
2025-05-07T20:25:59.2152967Z 
2025-05-07T20:25:59.2152972Z 
2025-05-07T20:25:59.2153181Z [A[A[A
2025-05-07T20:25:59.2153327Z 
2025-05-07T20:25:59.2153332Z 
2025-05-07T20:25:59.2153337Z 
2025-05-07T20:25:59.2153342Z 
2025-05-07T20:25:59.2153491Z [A[A[A[A
2025-05-07T20:25:59.2153653Z 
2025-05-07T20:25:59.2153658Z 
2025-05-07T20:25:59.2153663Z 
2025-05-07T20:25:59.2153668Z 
2025-05-07T20:25:59.2153674Z 
2025-05-07T20:25:59.2153823Z [A[A[A[A[A
2025-05-07T20:25:59.2153984Z 
2025-05-07T20:25:59.2153990Z 
2025-05-07T20:25:59.2153995Z 
2025-05-07T20:25:59.2154000Z 
2025-05-07T20:25:59.2154005Z 
2025-05-07T20:25:59.2154010Z 
2025-05-07T20:25:59.2154155Z [A[A[A[A[A[A
2025-05-07T20:25:59.2154335Z 
2025-05-07T20:25:59.2154340Z 
2025-05-07T20:25:59.2154345Z 
2025-05-07T20:25:59.2154351Z 
2025-05-07T20:25:59.2154366Z 
2025-05-07T20:25:59.2154371Z 
2025-05-07T20:25:59.2154376Z 
2025-05-07T20:25:59.2154532Z [A[A[A[A[A[A[A
2025-05-07T20:25:59.2154723Z 
2025-05-07T20:25:59.2154728Z 
2025-05-07T20:25:59.2154733Z 
2025-05-07T20:25:59.2154738Z 
2025-05-07T20:25:59.2154743Z 
2025-05-07T20:25:59.2154748Z 
2025-05-07T20:25:59.2154753Z 
2025-05-07T20:25:59.2154758Z 
2025-05-07T20:25:59.2154915Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2155125Z 
2025-05-07T20:25:59.2155130Z 
2025-05-07T20:25:59.2155142Z 
2025-05-07T20:25:59.2155147Z 
2025-05-07T20:25:59.2155152Z 
2025-05-07T20:25:59.2155157Z 
2025-05-07T20:25:59.2155162Z 
2025-05-07T20:25:59.2155167Z 
2025-05-07T20:25:59.2155173Z 
2025-05-07T20:25:59.2155339Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2155543Z 
2025-05-07T20:25:59.2155548Z 
2025-05-07T20:25:59.2155553Z 
2025-05-07T20:25:59.2155558Z 
2025-05-07T20:25:59.2155563Z 
2025-05-07T20:25:59.2155568Z 
2025-05-07T20:25:59.2155573Z 
2025-05-07T20:25:59.2155578Z 
2025-05-07T20:25:59.2155589Z 
2025-05-07T20:25:59.2155594Z 
2025-05-07T20:25:59.2155773Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2155985Z 
2025-05-07T20:25:59.2155990Z 
2025-05-07T20:25:59.2155996Z 
2025-05-07T20:25:59.2156001Z 
2025-05-07T20:25:59.2156006Z 
2025-05-07T20:25:59.2156011Z 
2025-05-07T20:25:59.2156016Z 
2025-05-07T20:25:59.2156021Z 
2025-05-07T20:25:59.2156026Z 
2025-05-07T20:25:59.2156031Z 
2025-05-07T20:25:59.2156036Z 
2025-05-07T20:25:59.2156218Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2156556Z 
2025-05-07T20:25:59.2156563Z 
2025-05-07T20:25:59.2156568Z 
2025-05-07T20:25:59.2156573Z 
2025-05-07T20:25:59.2156578Z 
2025-05-07T20:25:59.2156583Z 
2025-05-07T20:25:59.2156588Z 
2025-05-07T20:25:59.2156594Z 
2025-05-07T20:25:59.2156599Z 
2025-05-07T20:25:59.2156611Z 
2025-05-07T20:25:59.2156617Z 
2025-05-07T20:25:59.2156622Z 
2025-05-07T20:25:59.2156808Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2157053Z 
2025-05-07T20:25:59.2157058Z 
2025-05-07T20:25:59.2157064Z 
2025-05-07T20:25:59.2157176Z 
2025-05-07T20:25:59.2157190Z 
2025-05-07T20:25:59.2157195Z 
2025-05-07T20:25:59.2157200Z 
2025-05-07T20:25:59.2157205Z 
2025-05-07T20:25:59.2157210Z 
2025-05-07T20:25:59.2157215Z 
2025-05-07T20:25:59.2157220Z 
2025-05-07T20:25:59.2157225Z 
2025-05-07T20:25:59.2157230Z 
2025-05-07T20:25:59.2157418Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2157677Z 
2025-05-07T20:25:59.2157682Z 
2025-05-07T20:25:59.2157687Z 
2025-05-07T20:25:59.2157691Z 
2025-05-07T20:25:59.2157704Z 
2025-05-07T20:25:59.2157709Z 
2025-05-07T20:25:59.2157714Z 
2025-05-07T20:25:59.2157719Z 
2025-05-07T20:25:59.2157725Z 
2025-05-07T20:25:59.2157729Z 
2025-05-07T20:25:59.2157735Z 
2025-05-07T20:25:59.2157739Z 
2025-05-07T20:25:59.2157745Z 
2025-05-07T20:25:59.2157749Z 
2025-05-07T20:25:59.2157936Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2158204Z 
2025-05-07T20:25:59.2158209Z 
2025-05-07T20:25:59.2158214Z 
2025-05-07T20:25:59.2158220Z 
2025-05-07T20:25:59.2158225Z 
2025-05-07T20:25:59.2158230Z 
2025-05-07T20:25:59.2158242Z 
2025-05-07T20:25:59.2158247Z 
2025-05-07T20:25:59.2158252Z 
2025-05-07T20:25:59.2158257Z 
2025-05-07T20:25:59.2158262Z 
2025-05-07T20:25:59.2158267Z 
2025-05-07T20:25:59.2158272Z 
2025-05-07T20:25:59.2158278Z 
2025-05-07T20:25:59.2158283Z 
2025-05-07T20:25:59.2158489Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2158760Z 
2025-05-07T20:25:59.2158765Z 
2025-05-07T20:25:59.2158771Z 
2025-05-07T20:25:59.2158776Z 
2025-05-07T20:25:59.2158781Z 
2025-05-07T20:25:59.2158794Z 
2025-05-07T20:25:59.2158799Z 
2025-05-07T20:25:59.2158804Z 
2025-05-07T20:25:59.2158820Z 
2025-05-07T20:25:59.2158825Z 
2025-05-07T20:25:59.2158830Z 
2025-05-07T20:25:59.2158835Z 
2025-05-07T20:25:59.2158841Z 
2025-05-07T20:25:59.2158845Z 
2025-05-07T20:25:59.2158850Z 
2025-05-07T20:25:59.2158856Z 
2025-05-07T20:25:59.2159064Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2159344Z 
2025-05-07T20:25:59.2159349Z 
2025-05-07T20:25:59.2159354Z 
2025-05-07T20:25:59.2159359Z 
2025-05-07T20:25:59.2159371Z 
2025-05-07T20:25:59.2159376Z 
2025-05-07T20:25:59.2159381Z 
2025-05-07T20:25:59.2159386Z 
2025-05-07T20:25:59.2159391Z 
2025-05-07T20:25:59.2159396Z 
2025-05-07T20:25:59.2159401Z 
2025-05-07T20:25:59.2159406Z 
2025-05-07T20:25:59.2159411Z 
2025-05-07T20:25:59.2159416Z 
2025-05-07T20:25:59.2159422Z 
2025-05-07T20:25:59.2159427Z 
2025-05-07T20:25:59.2159432Z 
2025-05-07T20:25:59.2159649Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2159933Z 
2025-05-07T20:25:59.2159938Z 
2025-05-07T20:25:59.2159943Z 
2025-05-07T20:25:59.2159948Z 
2025-05-07T20:25:59.2159952Z 
2025-05-07T20:25:59.2159957Z 
2025-05-07T20:25:59.2159962Z 
2025-05-07T20:25:59.2159967Z 
2025-05-07T20:25:59.2159972Z 
2025-05-07T20:25:59.2159977Z 
2025-05-07T20:25:59.2159982Z 
2025-05-07T20:25:59.2159987Z 
2025-05-07T20:25:59.2159992Z 
2025-05-07T20:25:59.2160005Z 
2025-05-07T20:25:59.2160010Z 
2025-05-07T20:25:59.2160016Z 
2025-05-07T20:25:59.2160021Z 
2025-05-07T20:25:59.2160026Z 
2025-05-07T20:25:59.2160246Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2160527Z 
2025-05-07T20:25:59.2160532Z 
2025-05-07T20:25:59.2160672Z [A
2025-05-07T20:25:59.2160806Z 
2025-05-07T20:25:59.2160811Z 
2025-05-07T20:25:59.2160950Z [A[A
2025-05-07T20:25:59.2161095Z 
2025-05-07T20:25:59.2161100Z 
2025-05-07T20:25:59.2161106Z 
2025-05-07T20:25:59.2161244Z [A[A[A
2025-05-07T20:25:59.2161395Z 
2025-05-07T20:25:59.2161400Z 
2025-05-07T20:25:59.2161405Z 
2025-05-07T20:25:59.2161517Z 
2025-05-07T20:25:59.2161665Z [A[A[A[A
2025-05-07T20:25:59.2161828Z 
2025-05-07T20:25:59.2161833Z 
2025-05-07T20:25:59.2161838Z 
2025-05-07T20:25:59.2161843Z 
2025-05-07T20:25:59.2161848Z 
2025-05-07T20:25:59.2161991Z [A[A[A[A[A
2025-05-07T20:25:59.2162152Z 
2025-05-07T20:25:59.2162165Z 
2025-05-07T20:25:59.2162171Z 
2025-05-07T20:25:59.2162176Z 
2025-05-07T20:25:59.2162181Z 
2025-05-07T20:25:59.2162186Z 
2025-05-07T20:25:59.2162333Z [A[A[A[A[A[A
2025-05-07T20:25:59.2162500Z 
2025-05-07T20:25:59.2162599Z 
2025-05-07T20:25:59.2162604Z 
2025-05-07T20:25:59.2162618Z 
2025-05-07T20:25:59.2162622Z 
2025-05-07T20:25:59.2162627Z 
2025-05-07T20:25:59.2162632Z 
2025-05-07T20:25:59.2162791Z [A[A[A[A[A[A[A
2025-05-07T20:25:59.2162978Z 
2025-05-07T20:25:59.2162984Z 
2025-05-07T20:25:59.2162989Z 
2025-05-07T20:25:59.2162994Z 
2025-05-07T20:25:59.2163007Z 
2025-05-07T20:25:59.2163012Z 
2025-05-07T20:25:59.2163017Z 
2025-05-07T20:25:59.2163022Z 
2025-05-07T20:25:59.2163193Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2163392Z 
2025-05-07T20:25:59.2163397Z 
2025-05-07T20:25:59.2163402Z 
2025-05-07T20:25:59.2163415Z 
2025-05-07T20:25:59.2163421Z 
2025-05-07T20:25:59.2163426Z 
2025-05-07T20:25:59.2163431Z 
2025-05-07T20:25:59.2163436Z 
2025-05-07T20:25:59.2163442Z 
2025-05-07T20:25:59.2163605Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2163956Z 
2025-05-07T20:25:59.2163962Z 
2025-05-07T20:25:59.2163973Z 
2025-05-07T20:25:59.2163977Z 
2025-05-07T20:25:59.2163981Z 
2025-05-07T20:25:59.2163984Z 
2025-05-07T20:25:59.2163996Z 
2025-05-07T20:25:59.2164000Z 
2025-05-07T20:25:59.2164003Z 
2025-05-07T20:25:59.2164007Z 
2025-05-07T20:25:59.2164139Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2164308Z 
2025-05-07T20:25:59.2164312Z 
2025-05-07T20:25:59.2164316Z 
2025-05-07T20:25:59.2164320Z 
2025-05-07T20:25:59.2164323Z 
2025-05-07T20:25:59.2164327Z 
2025-05-07T20:25:59.2164331Z 
2025-05-07T20:25:59.2164335Z 
2025-05-07T20:25:59.2164338Z 
2025-05-07T20:25:59.2164342Z 
2025-05-07T20:25:59.2164346Z 
2025-05-07T20:25:59.2164478Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2164653Z 
2025-05-07T20:25:59.2164657Z 
2025-05-07T20:25:59.2164660Z 
2025-05-07T20:25:59.2164664Z 
2025-05-07T20:25:59.2164668Z 
2025-05-07T20:25:59.2164671Z 
2025-05-07T20:25:59.2164675Z 
2025-05-07T20:25:59.2164679Z 
2025-05-07T20:25:59.2164682Z 
2025-05-07T20:25:59.2164686Z 
2025-05-07T20:25:59.2164690Z 
2025-05-07T20:25:59.2164693Z 
2025-05-07T20:25:59.2164829Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2165005Z 
2025-05-07T20:25:59.2165013Z 
2025-05-07T20:25:59.2165017Z 
2025-05-07T20:25:59.2165021Z 
2025-05-07T20:25:59.2165024Z 
2025-05-07T20:25:59.2165028Z 
2025-05-07T20:25:59.2165031Z 
2025-05-07T20:25:59.2165035Z 
2025-05-07T20:25:59.2165038Z 
2025-05-07T20:25:59.2165042Z 
2025-05-07T20:25:59.2165045Z 
2025-05-07T20:25:59.2165049Z 
2025-05-07T20:25:59.2165053Z 
2025-05-07T20:25:59.2165188Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2165368Z 
2025-05-07T20:25:59.2165372Z 
2025-05-07T20:25:59.2165380Z 
2025-05-07T20:25:59.2165383Z 
2025-05-07T20:25:59.2165387Z 
2025-05-07T20:25:59.2165391Z 
2025-05-07T20:25:59.2165394Z 
2025-05-07T20:25:59.2165398Z 
2025-05-07T20:25:59.2165401Z 
2025-05-07T20:25:59.2165410Z 
2025-05-07T20:25:59.2165413Z 
2025-05-07T20:25:59.2165417Z 
2025-05-07T20:25:59.2165421Z 
2025-05-07T20:25:59.2165424Z 
2025-05-07T20:25:59.2165606Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2165793Z 
2025-05-07T20:25:59.2165804Z 
2025-05-07T20:25:59.2165807Z 
2025-05-07T20:25:59.2165814Z 
2025-05-07T20:25:59.2165818Z 
2025-05-07T20:25:59.2165821Z 
2025-05-07T20:25:59.2165825Z 
2025-05-07T20:25:59.2165828Z 
2025-05-07T20:25:59.2165832Z 
2025-05-07T20:25:59.2165836Z 
2025-05-07T20:25:59.2165839Z 
2025-05-07T20:25:59.2165843Z 
2025-05-07T20:25:59.2165846Z 
2025-05-07T20:25:59.2165850Z 
2025-05-07T20:25:59.2165853Z 
2025-05-07T20:25:59.2165996Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2166194Z 
2025-05-07T20:25:59.2166198Z 
2025-05-07T20:25:59.2166311Z 
2025-05-07T20:25:59.2166315Z 
2025-05-07T20:25:59.2166319Z 
2025-05-07T20:25:59.2166323Z 
2025-05-07T20:25:59.2166326Z 
2025-05-07T20:25:59.2166330Z 
2025-05-07T20:25:59.2166333Z 
2025-05-07T20:25:59.2166337Z 
2025-05-07T20:25:59.2166340Z 
2025-05-07T20:25:59.2166344Z 
2025-05-07T20:25:59.2166348Z 
2025-05-07T20:25:59.2166351Z 
2025-05-07T20:25:59.2166355Z 
2025-05-07T20:25:59.2166358Z 
2025-05-07T20:25:59.2166512Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2166764Z 
2025-05-07T20:25:59.2166869Z 
2025-05-07T20:25:59.2166875Z 
2025-05-07T20:25:59.2166880Z 
2025-05-07T20:25:59.2166885Z 
2025-05-07T20:25:59.2166890Z 
2025-05-07T20:25:59.2166895Z 
2025-05-07T20:25:59.2166901Z 
2025-05-07T20:25:59.2166906Z 
2025-05-07T20:25:59.2166920Z 
2025-05-07T20:25:59.2166925Z 
2025-05-07T20:25:59.2166930Z 
2025-05-07T20:25:59.2166935Z 
2025-05-07T20:25:59.2166940Z 
2025-05-07T20:25:59.2166945Z 
2025-05-07T20:25:59.2166950Z 
2025-05-07T20:25:59.2166956Z 
2025-05-07T20:25:59.2167206Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2167494Z 
2025-05-07T20:25:59.2167500Z 
2025-05-07T20:25:59.2167504Z 
2025-05-07T20:25:59.2167510Z 
2025-05-07T20:25:59.2167514Z 
2025-05-07T20:25:59.2167520Z 
2025-05-07T20:25:59.2167525Z 
2025-05-07T20:25:59.2167530Z 
2025-05-07T20:25:59.2167535Z 
2025-05-07T20:25:59.2167540Z 
2025-05-07T20:25:59.2167545Z 
2025-05-07T20:25:59.2167550Z 
2025-05-07T20:25:59.2167555Z 
2025-05-07T20:25:59.2167560Z 
2025-05-07T20:25:59.2167565Z 
2025-05-07T20:25:59.2167570Z 
2025-05-07T20:25:59.2167583Z 
2025-05-07T20:25:59.2167588Z 
2025-05-07T20:25:59.2167814Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2168083Z 
2025-05-07T20:25:59.2168086Z 
2025-05-07T20:25:59.2168189Z [A
2025-05-07T20:25:59.2168297Z 
2025-05-07T20:25:59.2168301Z 
2025-05-07T20:25:59.2168402Z [A[A
2025-05-07T20:25:59.2168514Z 
2025-05-07T20:25:59.2168518Z 
2025-05-07T20:25:59.2168522Z 
2025-05-07T20:25:59.2168621Z [A[A[A
2025-05-07T20:25:59.2168725Z 
2025-05-07T20:25:59.2168735Z 
2025-05-07T20:25:59.2168739Z 
2025-05-07T20:25:59.2168743Z 
2025-05-07T20:25:59.2168855Z [A[A[A[A
2025-05-07T20:25:59.2168968Z 
2025-05-07T20:25:59.2168972Z 
2025-05-07T20:25:59.2168976Z 
2025-05-07T20:25:59.2168979Z 
2025-05-07T20:25:59.2168983Z 
2025-05-07T20:25:59.2169098Z [A[A[A[A[A
2025-05-07T20:25:59.2169218Z 
2025-05-07T20:25:59.2169221Z 
2025-05-07T20:25:59.2169225Z 
2025-05-07T20:25:59.2169228Z 
2025-05-07T20:25:59.2169232Z 
2025-05-07T20:25:59.2169236Z 
2025-05-07T20:25:59.2169354Z [A[A[A[A[A[A
2025-05-07T20:25:59.2169484Z 
2025-05-07T20:25:59.2169487Z 
2025-05-07T20:25:59.2169491Z 
2025-05-07T20:25:59.2169494Z 
2025-05-07T20:25:59.2169498Z 
2025-05-07T20:25:59.2169502Z 
2025-05-07T20:25:59.2169505Z 
2025-05-07T20:25:59.2169623Z [A[A[A[A[A[A[A
2025-05-07T20:25:59.2169760Z 
2025-05-07T20:25:59.2169764Z 
2025-05-07T20:25:59.2169768Z 
2025-05-07T20:25:59.2169771Z 
2025-05-07T20:25:59.2169775Z 
2025-05-07T20:25:59.2169778Z 
2025-05-07T20:25:59.2169782Z 
2025-05-07T20:25:59.2169790Z 
2025-05-07T20:25:59.2169915Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2170058Z 
2025-05-07T20:25:59.2170062Z 
2025-05-07T20:25:59.2170065Z 
2025-05-07T20:25:59.2170069Z 
2025-05-07T20:25:59.2170072Z 
2025-05-07T20:25:59.2170076Z 
2025-05-07T20:25:59.2170079Z 
2025-05-07T20:25:59.2170083Z 
2025-05-07T20:25:59.2170087Z 
2025-05-07T20:25:59.2170210Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2170363Z 
2025-05-07T20:25:59.2170366Z 
2025-05-07T20:25:59.2170370Z 
2025-05-07T20:25:59.2170374Z 
2025-05-07T20:25:59.2170381Z 
2025-05-07T20:25:59.2170385Z 
2025-05-07T20:25:59.2170388Z 
2025-05-07T20:25:59.2170392Z 
2025-05-07T20:25:59.2170395Z 
2025-05-07T20:25:59.2170399Z 
2025-05-07T20:25:59.2170529Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2170687Z 
2025-05-07T20:25:59.2170691Z 
2025-05-07T20:25:59.2170694Z 
2025-05-07T20:25:59.2170698Z 
2025-05-07T20:25:59.2170701Z 
2025-05-07T20:25:59.2170705Z 
2025-05-07T20:25:59.2170709Z 
2025-05-07T20:25:59.2170712Z 
2025-05-07T20:25:59.2170833Z 
2025-05-07T20:25:59.2170846Z 
2025-05-07T20:25:59.2170850Z 
2025-05-07T20:25:59.2170978Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2171148Z 
2025-05-07T20:25:59.2171151Z 
2025-05-07T20:25:59.2171155Z 
2025-05-07T20:25:59.2171158Z 
2025-05-07T20:25:59.2171162Z 
2025-05-07T20:25:59.2171166Z 
2025-05-07T20:25:59.2171175Z 
2025-05-07T20:25:59.2171179Z 
2025-05-07T20:25:59.2171182Z 
2025-05-07T20:25:59.2171186Z 
2025-05-07T20:25:59.2171189Z 
2025-05-07T20:25:59.2171193Z 
2025-05-07T20:25:59.2171323Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2171576Z 
2025-05-07T20:25:59.2171585Z 
2025-05-07T20:25:59.2171589Z 
2025-05-07T20:25:59.2171592Z 
2025-05-07T20:25:59.2171596Z 
2025-05-07T20:25:59.2171599Z 
2025-05-07T20:25:59.2171603Z 
2025-05-07T20:25:59.2171607Z 
2025-05-07T20:25:59.2171610Z 
2025-05-07T20:25:59.2171614Z 
2025-05-07T20:25:59.2171617Z 
2025-05-07T20:25:59.2171621Z 
2025-05-07T20:25:59.2171624Z 
2025-05-07T20:25:59.2171758Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2171952Z 
2025-05-07T20:25:59.2171956Z 
2025-05-07T20:25:59.2171960Z 
2025-05-07T20:25:59.2171964Z 
2025-05-07T20:25:59.2171967Z 
2025-05-07T20:25:59.2171971Z 
2025-05-07T20:25:59.2171974Z 
2025-05-07T20:25:59.2171978Z 
2025-05-07T20:25:59.2171981Z 
2025-05-07T20:25:59.2171985Z 
2025-05-07T20:25:59.2171988Z 
2025-05-07T20:25:59.2171992Z 
2025-05-07T20:25:59.2171996Z 
2025-05-07T20:25:59.2171999Z 
2025-05-07T20:25:59.2172145Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2172334Z 
2025-05-07T20:25:59.2172342Z 
2025-05-07T20:25:59.2172346Z 
2025-05-07T20:25:59.2172349Z 
2025-05-07T20:25:59.2172353Z 
2025-05-07T20:25:59.2172357Z 
2025-05-07T20:25:59.2172360Z 
2025-05-07T20:25:59.2172364Z 
2025-05-07T20:25:59.2172367Z 
2025-05-07T20:25:59.2172371Z 
2025-05-07T20:25:59.2172375Z 
2025-05-07T20:25:59.2172378Z 
2025-05-07T20:25:59.2172382Z 
2025-05-07T20:25:59.2172392Z 
2025-05-07T20:25:59.2172396Z 
2025-05-07T20:25:59.2172538Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2172735Z 
2025-05-07T20:25:59.2172739Z 
2025-05-07T20:25:59.2172742Z 
2025-05-07T20:25:59.2172746Z 
2025-05-07T20:25:59.2172749Z 
2025-05-07T20:25:59.2172760Z 
2025-05-07T20:25:59.2172764Z 
2025-05-07T20:25:59.2172768Z 
2025-05-07T20:25:59.2172771Z 
2025-05-07T20:25:59.2172775Z 
2025-05-07T20:25:59.2172779Z 
2025-05-07T20:25:59.2172783Z 
2025-05-07T20:25:59.2172787Z 
2025-05-07T20:25:59.2172790Z 
2025-05-07T20:25:59.2172794Z 
2025-05-07T20:25:59.2172798Z 
2025-05-07T20:25:59.2172945Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2173152Z 
2025-05-07T20:25:59.2173156Z 
2025-05-07T20:25:59.2173160Z 
2025-05-07T20:25:59.2173163Z 
2025-05-07T20:25:59.2173167Z 
2025-05-07T20:25:59.2173171Z 
2025-05-07T20:25:59.2173175Z 
2025-05-07T20:25:59.2173178Z 
2025-05-07T20:25:59.2173182Z 
2025-05-07T20:25:59.2173185Z 
2025-05-07T20:25:59.2173189Z 
2025-05-07T20:25:59.2173193Z 
2025-05-07T20:25:59.2173196Z 
2025-05-07T20:25:59.2173200Z 
2025-05-07T20:25:59.2173203Z 
2025-05-07T20:25:59.2173207Z 
2025-05-07T20:25:59.2173215Z 
2025-05-07T20:25:59.2173369Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2173571Z 
2025-05-07T20:25:59.2173575Z 
2025-05-07T20:25:59.2173578Z 
2025-05-07T20:25:59.2173582Z 
2025-05-07T20:25:59.2173586Z 
2025-05-07T20:25:59.2173589Z 
2025-05-07T20:25:59.2173593Z 
2025-05-07T20:25:59.2173596Z 
2025-05-07T20:25:59.2173600Z 
2025-05-07T20:25:59.2173604Z 
2025-05-07T20:25:59.2173616Z 
2025-05-07T20:25:59.2173619Z 
2025-05-07T20:25:59.2173623Z 
2025-05-07T20:25:59.2173627Z 
2025-05-07T20:25:59.2173634Z 
2025-05-07T20:25:59.2173638Z 
2025-05-07T20:25:59.2173642Z 
2025-05-07T20:25:59.2173646Z 
2025-05-07T20:25:59.2173810Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2174024Z 
2025-05-07T20:25:59.2174027Z 
2025-05-07T20:25:59.2174122Z [A
2025-05-07T20:25:59.2174221Z 
2025-05-07T20:25:59.2174225Z 
2025-05-07T20:25:59.2174330Z [A[A
2025-05-07T20:25:59.2174434Z 
2025-05-07T20:25:59.2174438Z 
2025-05-07T20:25:59.2174441Z 
2025-05-07T20:25:59.2174624Z [A[A[A
2025-05-07T20:25:59.2174738Z 
2025-05-07T20:25:59.2174741Z 
2025-05-07T20:25:59.2174745Z 
2025-05-07T20:25:59.2174749Z 
2025-05-07T20:25:59.2174854Z [A[A[A[A
2025-05-07T20:25:59.2174970Z 
2025-05-07T20:25:59.2174973Z 
2025-05-07T20:25:59.2174977Z 
2025-05-07T20:25:59.2174981Z 
2025-05-07T20:25:59.2174985Z 
2025-05-07T20:25:59.2175092Z [A[A[A[A[A
2025-05-07T20:25:59.2175214Z 
2025-05-07T20:25:59.2175218Z 
2025-05-07T20:25:59.2175222Z 
2025-05-07T20:25:59.2175225Z 
2025-05-07T20:25:59.2175307Z 
2025-05-07T20:25:59.2175311Z 
2025-05-07T20:25:59.2175422Z [A[A[A[A[A[A
2025-05-07T20:25:59.2175552Z 
2025-05-07T20:25:59.2175556Z 
2025-05-07T20:25:59.2175559Z 
2025-05-07T20:25:59.2175563Z 
2025-05-07T20:25:59.2175566Z 
2025-05-07T20:25:59.2175570Z 
2025-05-07T20:25:59.2175574Z 
2025-05-07T20:25:59.2175685Z [A[A[A[A[A[A[A
2025-05-07T20:25:59.2175818Z 
2025-05-07T20:25:59.2175827Z 
2025-05-07T20:25:59.2175831Z 
2025-05-07T20:25:59.2175835Z 
2025-05-07T20:25:59.2175838Z 
2025-05-07T20:25:59.2175847Z 
2025-05-07T20:25:59.2175851Z 
2025-05-07T20:25:59.2175855Z 
2025-05-07T20:25:59.2175971Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2176114Z 
2025-05-07T20:25:59.2176124Z 
2025-05-07T20:25:59.2176128Z 
2025-05-07T20:25:59.2176132Z 
2025-05-07T20:25:59.2176135Z 
2025-05-07T20:25:59.2176139Z 
2025-05-07T20:25:59.2176142Z 
2025-05-07T20:25:59.2176146Z 
2025-05-07T20:25:59.2176149Z 
2025-05-07T20:25:59.2176269Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2176427Z 
2025-05-07T20:25:59.2176431Z 
2025-05-07T20:25:59.2176440Z 
2025-05-07T20:25:59.2176444Z 
2025-05-07T20:25:59.2176448Z 
2025-05-07T20:25:59.2176452Z 
2025-05-07T20:25:59.2176456Z 
2025-05-07T20:25:59.2176459Z 
2025-05-07T20:25:59.2176463Z 
2025-05-07T20:25:59.2176467Z 
2025-05-07T20:25:59.2176594Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2176760Z 
2025-05-07T20:25:59.2176764Z 
2025-05-07T20:25:59.2176768Z 
2025-05-07T20:25:59.2176771Z 
2025-05-07T20:25:59.2176775Z 
2025-05-07T20:25:59.2176779Z 
2025-05-07T20:25:59.2176787Z 
2025-05-07T20:25:59.2176791Z 
2025-05-07T20:25:59.2176794Z 
2025-05-07T20:25:59.2176798Z 
2025-05-07T20:25:59.2176802Z 
2025-05-07T20:25:59.2176939Z [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:59.5375294Z Preparing transaction: \ | / done
2025-05-07T20:26:01.1950806Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:26:02.1198665Z Executing transaction: | / - \ | / - \ | done
2025-05-07T20:26:04.5144536Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:04.5145119Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:04.5145824Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:04.5146385Z 
2025-05-07T20:26:04.5157042Z 
2025-05-07T20:26:04.5158090Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:04.5158853Z 
2025-05-07T20:26:04.5170370Z 
2025-05-07T20:26:04.5170562Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:04.5176584Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:04.5180483Z 
2025-05-07T20:26:04.6779934Z 
2025-05-07T20:26:04.6785414Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:04.6789482Z 
2025-05-07T20:26:04.6808063Z 
2025-05-07T20:26:04.6808352Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:04.7177246Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:06.6084933Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:06.6752847Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:06.6753375Z 
2025-05-07T20:26:07.1045356Z 
2025-05-07T20:26:07.1055177Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:07.1402038Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:07.1402726Z 
2025-05-07T20:26:07.5816188Z 
2025-05-07T20:26:07.5816587Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:07.5817767Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:07.5818487Z 
2025-05-07T20:26:08.0221738Z 
2025-05-07T20:26:10.0723246Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:12.1107490Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:14.1649524Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:14.1650444Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:16.2153688Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:18.1379241Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:18.1379515Z 
2025-05-07T20:26:18.2032668Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:22.0978126Z /tmp/tmpdjpi5s50: line 3: clang: command not found
2025-05-07T20:26:22.0978417Z 
2025-05-07T20:26:22.0978702Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:22.1662481Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:22.1662813Z 
2025-05-07T20:26:22.1685129Z total 36
2025-05-07T20:26:22.1685627Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:26:22.1686032Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:22.1686917Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:22.1687691Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:22.1688405Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:22.1689125Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:22.1689780Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:22.1690414Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:26:22.1691033Z 
2025-05-07T20:26:22.1691337Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:22.1692233Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:22.1692811Z 
2025-05-07T20:26:22.1710963Z 
2025-05-07T20:26:22.1711266Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:22.1711539Z 
2025-05-07T20:26:24.1566883Z 
2025-05-07T20:26:24.1567407Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:24.1567996Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:24.1568383Z 
2025-05-07T20:26:24.5865851Z 
2025-05-07T20:26:24.5866205Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:24.5866470Z 
2025-05-07T20:26:26.4978698Z -allow-unsupported-compiler
2025-05-07T20:26:26.4978964Z 
2025-05-07T20:26:26.5631575Z 
2025-05-07T20:26:26.5632169Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:26.5633277Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:26.5633926Z 
2025-05-07T20:26:28.5449305Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:28.5450087Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:28.5450458Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:28.5450795Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:28.5451122Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:28.5451414Z #define _STL_PAIR_H 1
2025-05-07T20:26:28.5451766Z #define __cpp_attributes 200809L
2025-05-07T20:26:28.5452214Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:28.5452689Z #define __DELETE_THROW throw()
2025-05-07T20:26:28.5453003Z #define _PTRDIFF_T_ 
2025-05-07T20:26:28.5453329Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:28.5453767Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:28.5454168Z #define _IO_LEFT 02
2025-05-07T20:26:28.5454478Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:28.5454837Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:28.5455123Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:28.5455548Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:28.5455974Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:28.5456391Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:28.5456757Z #define _IOS_OUTPUT 2
2025-05-07T20:26:28.5457183Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:28.5457698Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:28.5458141Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:28.5458533Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:28.5458925Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:28.5460013Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:28.5461130Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:28.5461505Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:28.5461992Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:28.5462443Z #define _T_WCHAR_ 
2025-05-07T20:26:28.5462738Z #define stdout stdout
2025-05-07T20:26:28.5463195Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:28.5463913Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:28.5464164Z #define __flexarr []
2025-05-07T20:26:28.5464402Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:28.5464723Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:28.5465057Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:28.5465311Z #define _MATH_H 1
2025-05-07T20:26:28.5465659Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:28.5465996Z #define __S64_TYPE long int
2025-05-07T20:26:28.5466245Z #define __stub_fchflags 
2025-05-07T20:26:28.5466682Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:28.5466969Z #define __SQUAD_TYPE long int
2025-05-07T20:26:28.5467229Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:28.5467490Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:28.5467741Z #define NL_NMAX INT_MAX
2025-05-07T20:26:28.5467969Z #define _BITS_TIME_H 1
2025-05-07T20:26:28.5468242Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:28.5468569Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:28.5468865Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:28.5469220Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:28.5469612Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:28.5469967Z #define __CHAR_BIT__ 8
2025-05-07T20:26:28.5471656Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:28.5472023Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:28.5472314Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:28.5472584Z #define FP_NAN 0
2025-05-07T20:26:28.5472841Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:28.5473276Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:28.5473753Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:28.5474136Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:28.5474417Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:28.5474675Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:28.5474929Z #define __SM_80_RT_H__ 
2025-05-07T20:26:28.5475152Z #define _NEW 
2025-05-07T20:26:28.5475366Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:28.5475644Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:28.5476005Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:28.5476396Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:28.5476631Z #define __USE_ANSI 1
2025-05-07T20:26:28.5476911Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:28.5477298Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:28.5477650Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:28.5477954Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:28.5478227Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:28.5478501Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:28.5478777Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:28.5479057Z #define PIPE_BUF 4096
2025-05-07T20:26:28.5479376Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:28.5481116Z #define ADJ_TICK 0x4000
2025-05-07T20:26:28.5481391Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:28.5481703Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:28.5481964Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:28.5482280Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:28.5482737Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:28.5483255Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:28.5483812Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:28.5484071Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:28.5484336Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:28.5484621Z #define __cpp_static_assert 201411L
2025-05-07T20:26:28.5484961Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:28.5485294Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:28.5485670Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:28.5485952Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:28.5486245Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:28.5486526Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:28.5486825Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:28.5487170Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:28.5487509Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:28.5487786Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:28.5488094Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:28.5488522Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:28.5488871Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:28.5489166Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:28.5489447Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:28.5489768Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:28.5490090Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:28.5490485Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:28.5490885Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:28.5491183Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:28.5491446Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:28.5491714Z #define __GCC_IEC_559 2
2025-05-07T20:26:28.5491999Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:28.5492331Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:28.5492583Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:28.5492852Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:28.5493111Z #define _IOFBF 0
2025-05-07T20:26:28.5493316Z #define __USE_BSD 1
2025-05-07T20:26:28.5493538Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:28.5493802Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:28.5494065Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:28.5494312Z #define _IO_NO_WRITES 8
2025-05-07T20:26:28.5494562Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:28.5494910Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:28.5495255Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:28.5495554Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:28.5495872Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:28.5496156Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:28.5496420Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:28.5496685Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:28.5496986Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:28.5497376Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:28.5497734Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:28.5498031Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:28.5498336Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:28.5498885Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:28.5499185Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:28.5499494Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:28.5499767Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:28.5500033Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:28.5500604Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:28.5501194Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:28.5501518Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:28.5501833Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:28.5502143Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:28.5502423Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:28.5502680Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:28.5502984Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:28.5503309Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:28.5503608Z #define RAND_MAX 2147483647
2025-05-07T20:26:28.5503869Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:28.5504289Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:28.5504601Z #define __SM_90_RT_H__ 
2025-05-07T20:26:28.5504836Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:28.5505093Z #define __COMPAR_FN_T 
2025-05-07T20:26:28.5505331Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:28.5505617Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:28.5506113Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:28.5506642Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:28.5507068Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:28.5507416Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:28.5507713Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:28.5508054Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:28.5508358Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:28.5508866Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:28.5509417Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:28.5509753Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:28.5510017Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:28.5522535Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:28.5522835Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:28.5523087Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:28.5523342Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:28.5523587Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:28.5524015Z #define __u_char_defined 
2025-05-07T20:26:28.5524330Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:28.5524672Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:28.5524909Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:28.5525154Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:28.5525437Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:28.5525876Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:28.5526298Z #define FP_INFINITE 1
2025-05-07T20:26:28.5526670Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:28.5527086Z #define _IO_pid_t __pid_t
2025-05-07T20:26:28.5527345Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:28.5527603Z #define __LEAF , __leaf__
2025-05-07T20:26:28.5527843Z #define PATH_MAX 4096
2025-05-07T20:26:28.5528079Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:28.5528412Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:28.5528734Z #define _LIMITS_H___ 
2025-05-07T20:26:28.5528961Z #define __size_t 
2025-05-07T20:26:28.5529182Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:28.5529720Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:28.5530284Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:28.5530581Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:28.5530913Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:28.5531176Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:28.5531524Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:28.5531931Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:28.5532230Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:28.5532550Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:28.5532835Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:28.5533108Z #define __INT8_C(c) c
2025-05-07T20:26:28.5533372Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:28.5533672Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:28.5533925Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:28.5534181Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:28.5534431Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:28.5534702Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:28.5535017Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:28.5535338Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:28.5535601Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:28.5536069Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:28.5536362Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:28.5536673Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:28.5536972Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:28.5537326Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:28.5537699Z #define NFDBITS __NFDBITS
2025-05-07T20:26:28.5537949Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:28.5538232Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:28.5539106Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:28.5539426Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:28.5539673Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:28.5539957Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:28.5540255Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:28.5540554Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:28.5540971Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:28.5541325Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:28.5541601Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:28.5541910Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:28.5542277Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:28.5542602Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:28.5542913Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:28.5543240Z #define __daddr_t_defined 
2025-05-07T20:26:28.5543493Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:28.5543752Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:28.5544057Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:28.5544557Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:28.5545023Z #define _ACRTIMP 
2025-05-07T20:26:28.5545243Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:28.5545501Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:28.5545787Z #define _IOS_BIN 128
2025-05-07T20:26:28.5546147Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:28.5546592Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:28.5546847Z #define UNDERFLOW 4
2025-05-07T20:26:28.5547062Z #define NAME_MAX 255
2025-05-07T20:26:28.5547291Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:28.5547562Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:28.5547827Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:28.5548119Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:28.5548497Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:28.5548873Z #define __ptr_t void *
2025-05-07T20:26:28.5549112Z #define M_E 2.7182818284590452354
2025-05-07T20:26:28.5549386Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:28.5549642Z #define __USE_ISOCXX11 1
2025-05-07T20:26:28.5549906Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:28.5550227Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:28.5550510Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:28.5550781Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:28.5551062Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:28.5551362Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:28.5551619Z #define __linux 1
2025-05-07T20:26:28.5551842Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:28.5552112Z #define cudaDeviceMask 0xff
2025-05-07T20:26:28.5552374Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:28.5552666Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:28.5552940Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:28.5553217Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:28.5553522Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:28.5553821Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:28.5554102Z #define _BITS_TYPES_H 1
2025-05-07T20:26:28.5554387Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:28.5554976Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:28.5555273Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:28.5555547Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:28.5555829Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:28.5556107Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:28.5556878Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:28.5557685Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:28.5558105Z #define __unix 1
2025-05-07T20:26:28.5558309Z #define MATH_ERRNO 1
2025-05-07T20:26:28.5558546Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:28.5558823Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:28.5559080Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:28.5559357Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:28.5559637Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:28.5559913Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:28.5560365Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:28.5560821Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:28.5561112Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:28.5561358Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:28.5561625Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:28.5561904Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:28.5562156Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:28.5562401Z #define __SIZE_T 
2025-05-07T20:26:28.5562646Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:28.5562955Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:28.5563244Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:28.5563500Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:28.5563898Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:28.5564281Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:28.5564713Z #define __WAIT_STATUS void *
2025-05-07T20:26:28.5564974Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:28.5565233Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:28.5565493Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:28.5565776Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:28.5566043Z #define __WINT_MIN__ 0U
2025-05-07T20:26:28.5566602Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:28.5567240Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:28.5567528Z #define WUNTRACED 2
2025-05-07T20:26:28.5567754Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:28.5568025Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:28.5568295Z #define NZERO 20
2025-05-07T20:26:28.5568520Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:28.5568795Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:28.5569083Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:28.5569367Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:28.5569616Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:28.5569901Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:28.5570163Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:28.5570435Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:28.5570707Z #define EXIT_FAILURE 1
2025-05-07T20:26:28.5570933Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:28.5571188Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:28.5571451Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:28.5571698Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:28.5571979Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:28.5572309Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:28.5572656Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:28.5572946Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:28.5573192Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:28.5573461Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:28.5573842Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:28.5574146Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:28.5574431Z #define SEEK_DATA 3
2025-05-07T20:26:28.5574651Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:28.5574939Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:28.5575353Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:28.5575731Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:28.5575976Z #define __INT64_C(c) c ## L
2025-05-07T20:26:28.5576241Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:28.5576644Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:28.5576966Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:28.5577237Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:28.5577524Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:28.5577819Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:28.5578077Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:28.5578311Z #define WSTOPPED 2
2025-05-07T20:26:28.5578542Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:28.5578832Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:28.5579081Z #define FP_NORMAL 4
2025-05-07T20:26:28.5579312Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:28.5579594Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:28.5579830Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:28.5580075Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:28.5580359Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:28.5580628Z #define cudaTextureType1D 0x01
2025-05-07T20:26:28.5580917Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:28.5581179Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:28.5581436Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:28.5581730Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:28.5582161Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:28.5582599Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:28.5582860Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:28.5583126Z #define _POSIX_SOURCE 1
2025-05-07T20:26:28.5583373Z #define cudaTextureType2D 0x02
2025-05-07T20:26:28.5583633Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:28.5583899Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:28.5584203Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:28.5584463Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:28.5584780Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:28.5585114Z #define cudaTextureType3D 0x03
2025-05-07T20:26:28.5585375Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:28.5585640Z #define CLOCK_REALTIME 0
2025-05-07T20:26:28.5585889Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:28.5586155Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:28.5586455Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:28.5586728Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:28.5586994Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:28.5587274Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:28.5587545Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:28.5587836Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:28.5588128Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:28.5588404Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:28.5588642Z #define __GLIBC__ 2
2025-05-07T20:26:28.5588856Z #define __END_DECLS }
2025-05-07T20:26:28.5589090Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:28.5589450Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:28.5589815Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:28.5590067Z #define WCONTINUED 8
2025-05-07T20:26:28.5590296Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:28.5590540Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:28.5590809Z #define _ALLOCA_H 1
2025-05-07T20:26:28.5591036Z #define __host__ __location__(host)
2025-05-07T20:26:28.5591448Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:28.5591883Z #define __SLONG32_TYPE int
2025-05-07T20:26:28.5592241Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:28.5592514Z #define _SYS_SELECT_H 1
2025-05-07T20:26:28.5592752Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:28.5592996Z #define _IOS_NOCREATE 32
2025-05-07T20:26:28.5593234Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:28.5593512Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:28.5593803Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:28.5594085Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:28.5594364Z #define __global__ __location__(global)
2025-05-07T20:26:28.5594747Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:28.5595003Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:28.5595269Z #define __DBL_DIG__ 15
2025-05-07T20:26:28.5595494Z #define TIME_UTC 1
2025-05-07T20:26:28.5595710Z #define __FLT32_DIG__ 6
2025-05-07T20:26:28.5596031Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:28.5596467Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:28.5596780Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:28.5597087Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:28.5597382Z #define _G_BUFSIZ 8192
2025-05-07T20:26:28.5597680Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:28.5598045Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:28.5598335Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:28.5598614Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:28.5598904Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:28.5599141Z #define __GXX_WEAK__ 1
2025-05-07T20:26:28.5599401Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:28.5599707Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:28.5599956Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:28.5600250Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:28.5600586Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:28.5600854Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:28.5601140Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:28.5601436Z #define _G_config_h 1
2025-05-07T20:26:28.5601705Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:28.5602037Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:28.5602310Z #define _GCC_WCHAR_T 
2025-05-07T20:26:28.5602536Z #define TMP_MAX 238328
2025-05-07T20:26:28.5602764Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:28.5603028Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:28.5603287Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:28.5603554Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:28.5603977Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:28.5604260Z #define _IO_SKIPWS 01
2025-05-07T20:26:28.5604655Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:28.5605115Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:28.5605376Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:28.5605699Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:28.5606061Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:28.5606437Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:28.5606797Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:28.5607039Z #define le32toh(x) (x)
2025-05-07T20:26:28.5607272Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:28.5607524Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:28.5607854Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:28.5608200Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:28.5608590Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:28.5609002Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:28.5609266Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:28.5609525Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:28.5609781Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:28.5610060Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:28.5610592Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:28.5611217Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:28.5611521Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:28.5611869Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:28.5612181Z #define _WCHAR_T_ 
2025-05-07T20:26:28.5612401Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:28.5612760Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:28.5613142Z #define RTSIG_MAX 32
2025-05-07T20:26:28.5613355Z #define _STDDEF_H 
2025-05-07T20:26:28.5613668Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:28.5613940Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:28.5614180Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:28.5614509Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:28.5614893Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:28.5615209Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:28.5615493Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:28.5615954Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:28.5616470Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:28.5616825Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:28.5617139Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:28.5617449Z #define __unix__ 1
2025-05-07T20:26:28.5617669Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:28.5617944Z #define __INT_WIDTH__ 32
2025-05-07T20:26:28.5618182Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:28.5618416Z #define _IONBF 2
2025-05-07T20:26:28.5618856Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:28.5619609Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:28.5620135Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:28.5620383Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:28.5620646Z #define __UINT16_C(c) c
2025-05-07T20:26:28.5620884Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:28.5621142Z #define STA_DEL 0x0020
2025-05-07T20:26:28.5621378Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:26:28.5621628Z #define __id_t_defined 
2025-05-07T20:26:28.5621886Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:28.5622330Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:28.5622751Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:28.5623019Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:28.5623270Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:28.5623521Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:28.5623777Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:28.5624035Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:28.5624297Z #define SING 2
2025-05-07T20:26:28.5624514Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:28.5624770Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:28.5625077Z #define cudaStreamDefault 0x00
2025-05-07T20:26:28.5625425Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:28.5625787Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:28.5626059Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:28.5626324Z #define __gnu_linux__ 1
2025-05-07T20:26:28.5626555Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:28.5626810Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:28.5627058Z #define MAX_INPUT 255
2025-05-07T20:26:28.5627288Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:28.5627614Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:28.5627984Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:28.5628297Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:28.5628608Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:28.5629007Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:28.5629430Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:28.5629863Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:28.5630222Z #define _Mfloat_ float
2025-05-07T20:26:28.5630481Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:28.5630779Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:28.5631063Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:28.5631543Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:28.5632025Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:28.5632372Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:28.5632693Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:28.5633043Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:28.5633326Z #define __USE_ISOC11 1
2025-05-07T20:26:28.5633553Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:28.5633782Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:28.5634023Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:28.5634287Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:28.5634578Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:28.5634887Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:28.5635193Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:28.5635519Z #define __THROW throw ()
2025-05-07T20:26:28.5635761Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:28.5636040Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:28.5636388Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:28.5636732Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:28.5637000Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:28.5637255Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:28.5637515Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:28.5637766Z #define L_tmpnam 20
2025-05-07T20:26:28.5637984Z #define ___int_wchar_t_h 
2025-05-07T20:26:28.5638319Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:28.5639082Z #define isascii(c) __isascii (c)
2025-05-07T20:26:28.5647403Z #define _T_PTRDIFF 
2025-05-07T20:26:28.5647726Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:28.5648075Z #define toascii(c) __toascii (c)
2025-05-07T20:26:28.5648329Z #define __GNUC__ 11
2025-05-07T20:26:28.5648586Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:28.5648883Z #define __GXX_RTTI 1
2025-05-07T20:26:28.5649099Z #define __pie__ 2
2025-05-07T20:26:28.5649310Z #define __MMX__ 1
2025-05-07T20:26:28.5649523Z #define __cudaCDP2Malloc 
2025-05-07T20:26:28.5649788Z #define __timespec_defined 1
2025-05-07T20:26:28.5650036Z #define L_ctermid 9
2025-05-07T20:26:28.5650256Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:28.5650564Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:28.5650952Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:28.5651314Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:28.5651577Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:28.5651868Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:28.5652168Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:28.5652478Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:28.5652740Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:28.5653172Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:28.5653901Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:28.5654500Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:28.5654803Z #define __USE_SVID 1
2025-05-07T20:26:28.5655047Z #define __constant__ __location__(constant)
2025-05-07T20:26:28.5655356Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:28.5655649Z #define __device__ __location__(device)
2025-05-07T20:26:28.5655975Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:28.5656289Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:28.5656876Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:28.5657155Z #define CUDART_DEVICE __device__
2025-05-07T20:26:28.5657499Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:28.5657859Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:28.5658128Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:28.5658490Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:28.5658864Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:28.5659101Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:28.5659459Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:28.5660077Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:28.5660380Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:28.5660648Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:28.5660914Z #define NGROUPS_MAX 65536
2025-05-07T20:26:28.5661163Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:28.5661425Z #define __USE_ISOC95 1
2025-05-07T20:26:28.5661651Z #define _TIME_H 1
2025-05-07T20:26:28.5661920Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:28.5662243Z #define __USE_ISOC99 1
2025-05-07T20:26:28.5662565Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:28.5662931Z #define HOST_NAME_MAX 64
2025-05-07T20:26:28.5663173Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:28.5663432Z #define _IOS_ATEND 4
2025-05-07T20:26:28.5663667Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:28.5663985Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:28.5664396Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:28.5664749Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:28.5665029Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:28.5665358Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:28.5665674Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:28.5665924Z #define _STDIO_H 1
2025-05-07T20:26:28.5666374Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:28.5666844Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:28.5667203Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:28.5667568Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:28.5667854Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:28.5668121Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:28.5668377Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:28.5668667Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:28.5668973Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5669276Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:28.5669545Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:28.5669819Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:28.5670117Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:28.5670387Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:28.5670667Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:28.5671022Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:28.5671378Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:28.5671618Z #define __USE_XOPEN 1
2025-05-07T20:26:28.5671858Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:28.5672287Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:28.5672725Z #define __USE_XOPEN2K 1
2025-05-07T20:26:28.5672963Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:28.5673223Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:28.5673513Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:28.5673783Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:28.5674289Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:28.5674812Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:28.5675090Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:28.5675443Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:28.5675938Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:28.5676364Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:28.5676758Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:28.5677019Z #define __glibcxx_integral_traps true
2025-05-07T20:26:28.5677306Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:28.5677558Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:28.5677805Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:28.5678072Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:28.5678322Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:28.5678694Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:28.5678993Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:28.5679362Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:28.5679740Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:28.5680009Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:28.5680270Z #define _IO_UNITBUF 020000
2025-05-07T20:26:28.5680522Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:28.5680779Z #define __FD_SETSIZE 1024
2025-05-07T20:26:28.5681025Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:28.5681295Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:28.5681632Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:28.5681985Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:28.5682250Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:28.5682549Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:28.5682868Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:28.5683147Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:28.5683447Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:28.5683932Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:28.5684213Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:28.5684535Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:28.5684810Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:28.5685081Z #define __USE_POSIX199506 1
2025-05-07T20:26:28.5685323Z #define _FEATURES_H 1
2025-05-07T20:26:28.5685556Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:28.5685940Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:28.5686347Z #define __stub_getmsg 
2025-05-07T20:26:28.5686569Z #define _IO_FIXED 010000
2025-05-07T20:26:28.5686834Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:28.5687140Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:28.5687399Z #define __stub_setlogin 
2025-05-07T20:26:28.5687638Z #define __stub_fattach 
2025-05-07T20:26:28.5687878Z #define __cplusplus 201703L
2025-05-07T20:26:28.5688139Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:28.5688409Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:28.5688661Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:28.5688935Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:28.5689409Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:28.5689929Z #define _IO_INTERNAL 010
2025-05-07T20:26:28.5690177Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:28.5690500Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:28.5690847Z #define __dev_t_defined 
2025-05-07T20:26:28.5691080Z #define __DEPRECATED 1
2025-05-07T20:26:28.5691297Z #define __S32_TYPE int
2025-05-07T20:26:28.5691541Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:28.5691830Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:28.5692078Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:28.5692331Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:28.5692929Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:28.5693560Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:28.5693859Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:28.5694196Z #define OVERFLOW 3
2025-05-07T20:26:28.5694441Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:28.5694743Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:28.5695241Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:28.5695580Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:28.5695903Z #define __SSE2_MATH__ 1
2025-05-07T20:26:28.5696144Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:28.5696450Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:28.5696743Z #define _IO_STDIO_H 
2025-05-07T20:26:28.5696988Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:28.5697275Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:28.5697592Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:28.5697977Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5698281Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:28.5698540Z #define __amd64 1
2025-05-07T20:26:28.5698752Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:28.5699014Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:28.5699288Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:28.5699563Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:28.5699866Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:28.5700128Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:28.5700414Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:28.5700670Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:28.5700917Z #define __bounded 
2025-05-07T20:26:28.5701143Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:28.5701428Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:28.5701708Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:28.5701970Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:28.5702236Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:28.5702555Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:28.5702968Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:28.5703360Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:28.5703629Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:28.5703963Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:28.5704296Z #define STA_PLL 0x0001
2025-05-07T20:26:28.5704539Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:28.5704808Z #define __GNUG__ 11
2025-05-07T20:26:28.5705032Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:28.5705296Z #define _T_WCHAR 
2025-05-07T20:26:28.5705529Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:28.5705827Z #define __specialization_static 
2025-05-07T20:26:28.5706127Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:28.5706428Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:28.5706689Z #define cudaArraySparse 0x40
2025-05-07T20:26:28.5706950Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:28.5707190Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:28.5707474Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:28.5707772Z #define _WCHAR_T 
2025-05-07T20:26:28.5707983Z #define __cudaCDP2Free 
2025-05-07T20:26:28.5708617Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:28.5709301Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:28.5709705Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:28.5710132Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:28.5710404Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:28.5710669Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:28.5710992Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:28.5711348Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:28.5711588Z #define __NO_CTYPE 1
2025-05-07T20:26:28.5711816Z #define __stub_bdflush 
2025-05-07T20:26:28.5712169Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:28.5712580Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:28.5712875Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:28.5713135Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:28.5713407Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:28.5713797Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:28.5714088Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:28.5714421Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:28.5714759Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:28.5715031Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:28.5715307Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:28.5715641Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:28.5715977Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:28.5716331Z #define _IO_STDIO 040000
2025-05-07T20:26:28.5716648Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:28.5717026Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:28.5717330Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:28.5717613Z #define _PTRDIFF_T 
2025-05-07T20:26:28.5717820Z #define _MOVE_H 1
2025-05-07T20:26:28.5718034Z #define __cpp_hex_float 201603L
2025-05-07T20:26:28.5718290Z #define ADJ_TAI 0x0080
2025-05-07T20:26:28.5718514Z #define __ptrvalue 
2025-05-07T20:26:28.5718728Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:28.5718976Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:28.5719256Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:28.5719545Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:28.5719792Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:28.5720069Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:28.5720463Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:28.5720843Z #define __USE_GNU 1
2025-05-07T20:26:28.5721071Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:28.5721553Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:28.5721927Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:28.5722371Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:28.5722760Z #define WEXITED 4
2025-05-07T20:26:28.5722972Z #define _IO_NO_READS 4
2025-05-07T20:26:28.5723278Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:28.5723745Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:28.5724031Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:28.5724338Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:28.5724654Z #define __uid_t_defined 
2025-05-07T20:26:28.5724897Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:28.5725185Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:28.5725463Z #define WNOHANG 1
2025-05-07T20:26:28.5725709Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:28.5726018Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:28.5726294Z #define cudaEventDefault 0x00
2025-05-07T20:26:28.5726597Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:28.5726913Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:28.5727159Z #define __x86_64 1
2025-05-07T20:26:28.5727394Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:28.5727784Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:28.5728264Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:28.5728759Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:28.5729190Z #define __PTRDIFF_T 
2025-05-07T20:26:28.5729507Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:28.5729883Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:28.5730159Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:28.5730446Z #define _Mlong_double_ long double
2025-05-07T20:26:28.5730730Z #define __cpp_lambdas 200907L
2025-05-07T20:26:28.5730989Z #define _IO_DEC 020
2025-05-07T20:26:28.5731209Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:28.5731477Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:28.5731765Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:28.5732039Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:28.5732298Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:28.5732706Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:28.5733026Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:28.5733298Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:28.5733571Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:28.5733888Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:28.5734247Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:28.5734639Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:28.5734923Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:28.5735309Z #define __cpp_template_auto 201606L
2025-05-07T20:26:28.5735670Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:28.5736046Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:28.5736308Z #define __key_t_defined 
2025-05-07T20:26:28.5736562Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:28.5736933Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:28.5737405Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:28.5737778Z #define __GNUC_VA_LIST 
2025-05-07T20:26:28.5738116Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:28.5738836Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:28.5739118Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:28.5739395Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:28.5739687Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:28.5739929Z #define __WCOREFLAG 0x80
2025-05-07T20:26:28.5740189Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:28.5740505Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:28.5740775Z #define __LP64__ 1
2025-05-07T20:26:28.5741018Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:28.5741332Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:28.5741604Z #define _IO_off64_t __off64_t
2025-05-07T20:26:28.5741860Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:28.5742117Z #define __time_t_defined 1
2025-05-07T20:26:28.5742372Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:28.5742704Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:28.5743067Z #define __USE_UNIX98 1
2025-05-07T20:26:28.5743304Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:28.5743565Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:28.5743828Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:28.5744124Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:28.5744424Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:28.5744678Z #define SEEK_CUR 1
2025-05-07T20:26:28.5744912Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:28.5745168Z #define _ASSERT_H 1
2025-05-07T20:26:28.5745731Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:28.5746361Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:28.5746632Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:28.5746876Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:28.5747149Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:28.5747421Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:28.5747787Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:28.5748191Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:28.5748841Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:28.5749484Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:28.5749780Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:28.5750129Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:28.5750503Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:28.5750769Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:28.5751051Z #define cudaArrayDefault 0x00
2025-05-07T20:26:28.5751328Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:28.5751866Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:28.5752151Z #define TLOSS 5
2025-05-07T20:26:28.5752366Z #define __ssize_t_defined 
2025-05-07T20:26:28.5752610Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:26:28.5752879Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:28.5753167Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:28.5753450Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:28.5753812Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:28.5754193Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:28.5754677Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:28.5754982Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:28.5755274Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:28.5755560Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:28.5755813Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:28.5756149Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:28.5756545Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:28.5756782Z #define __cdecl 
2025-05-07T20:26:28.5757023Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:28.5757351Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:28.5757671Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:28.5757919Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:28.5758184Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:28.5758465Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:28.5758728Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:28.5759035Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:28.5759363Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:28.5759756Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:28.5760183Z #define ADJ_NANO 0x2000
2025-05-07T20:26:28.5760487Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:28.5760834Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:28.5761125Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:28.5761384Z #define __FLT_DIG__ 6
2025-05-07T20:26:28.5761726Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:28.5762124Z #define __NO_INLINE__ 1
2025-05-07T20:26:28.5762427Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:28.5762776Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:28.5763027Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:28.5763289Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:28.5763581Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:28.5764007Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:28.5764301Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:28.5764590Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:28.5764959Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:28.5765372Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:28.5765717Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:28.5766057Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:28.5766301Z #define MAX_CANON 255
2025-05-07T20:26:28.5766556Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:28.5766808Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:28.5767073Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:28.5774655Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:28.5774974Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:28.5775268Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:28.5775553Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:28.5775892Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:28.5776210Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:28.5776468Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:28.5776754Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:28.5777050Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:28.5777331Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:28.5777635Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:28.5778094Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:28.5778359Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:28.5778611Z #define _SYS_TYPES_H 1
2025-05-07T20:26:28.5778855Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:28.5779121Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:28.5779366Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:28.5779602Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:28.5779880Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:28.5780174Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:28.5780419Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:28.5780817Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:28.5781090Z #define FP_SUBNORMAL 3
2025-05-07T20:26:28.5781335Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:28.5781616Z #define _INITIALIZER_LIST 
2025-05-07T20:26:28.5781874Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:28.5782119Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:28.5782395Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:28.5782695Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:28.5782947Z #define _IO_file_flags _flags
2025-05-07T20:26:28.5783200Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:28.5783437Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:28.5783712Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:28.5783988Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:28.5784243Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:28.5784617Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:28.5785011Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:28.5785318Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:28.5785590Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:28.5785843Z #define _BSD_SOURCE 1
2025-05-07T20:26:28.5786078Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:28.5786925Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:28.5787770Z #define __catch(X) catch(X)
2025-05-07T20:26:28.5788028Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:28.5788309Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:28.5788581Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:28.5788832Z #define __STRING(x) #x
2025-05-07T20:26:28.5789068Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:28.5789337Z #define _T_PTRDIFF_ 
2025-05-07T20:26:28.5789584Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:28.5789888Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:28.5790157Z #define __unbounded 
2025-05-07T20:26:28.5790398Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:28.5790688Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:28.5790960Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:28.5791260Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:28.5791536Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:28.5791824Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:28.5792155Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:28.5792462Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:28.5792731Z #define __managed__ __location__(managed)
2025-05-07T20:26:28.5793027Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:28.5793420Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:28.5793838Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:28.5794088Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:28.5794456Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:28.5794865Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:28.5795106Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:28.5795393Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:28.5795727Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:28.5795998Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:28.5796328Z #define _CRTIMP 
2025-05-07T20:26:28.5796667Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:28.5796930Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:28.5797230Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:28.5797553Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:28.5797896Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:28.5798302Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:28.5798615Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:28.5798893Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:28.5799253Z #define __SIZE_T__ 
2025-05-07T20:26:28.5799466Z #define __stub_gtty 
2025-05-07T20:26:28.5799689Z #define __pid_t_defined 
2025-05-07T20:26:28.5799940Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:28.5800231Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:28.5800536Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:28.5800817Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:28.5801059Z #define __need_clockid_t 
2025-05-07T20:26:28.5801306Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:28.5801554Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:28.5801868Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:28.5802183Z #define _IO_HEX 0100
2025-05-07T20:26:28.5802431Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:28.5802761Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:28.5803069Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:28.5803343Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:28.5803906Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:28.5804352Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:28.5804664Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:28.5804949Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:28.5805061Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:28.5805163Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:28.5805243Z #define __stub_sstk 
2025-05-07T20:26:28.5805346Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:28.5805499Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:28.5805579Z #define __wur 
2025-05-07T20:26:28.5805701Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:28.5805786Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:28.5805877Z #define _IO_OCT 040
2025-05-07T20:26:28.5805971Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:28.5806058Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:28.5806155Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:28.5806284Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:28.5806378Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:28.5806485Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:28.5806669Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:28.5806762Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:28.5806856Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:28.5806962Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:28.5807054Z #define __off64_t_defined 
2025-05-07T20:26:28.5807155Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:28.5807241Z #define __FLT128_DIG__ 33
2025-05-07T20:26:28.5807351Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:28.5807447Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:28.5807528Z #define __INT32_C(c) c
2025-05-07T20:26:28.5807628Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:28.5807722Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:28.5807814Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:28.5807909Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:28.5807999Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:28.5808098Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:28.5808233Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:28.5808326Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:28.5808416Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:28.5808513Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:28.5808604Z #define __have_pthread_attr_t 1
2025-05-07T20:26:28.5808806Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:28.5809027Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:28.5809132Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:28.5809242Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:28.5809335Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:28.5809418Z #define htole32(x) (x)
2025-05-07T20:26:28.5809671Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:28.5809789Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:28.5809966Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:28.5810125Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:28.5810262Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:28.5810392Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:28.5810527Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:28.5810617Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:28.5810728Z #define cudaArrayLayered 0x01
2025-05-07T20:26:28.5810893Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:28.5810999Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:28.5811100Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:28.5811198Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:28.5811276Z #define unix 1
2025-05-07T20:26:28.5811376Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:28.5811465Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:28.5811563Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:28.5811683Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:28.5811764Z #define __USE_POSIX 1
2025-05-07T20:26:28.5811861Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:28.5811990Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:28.5812076Z #define __THROWNL throw ()
2025-05-07T20:26:28.5812173Z #define __cpp_rtti 199711L
2025-05-07T20:26:28.5812274Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:28.5812360Z #define __PMT(args) args
2025-05-07T20:26:28.5812481Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:28.5812625Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:28.5812736Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:28.5812833Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:28.5812925Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:28.5813019Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:28.5813406Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:28.5813507Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:28.5813606Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:28.5813699Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:28.5813836Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:28.5813926Z #define _WCHAR_T_H 
2025-05-07T20:26:28.5814014Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:28.5814100Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:28.5814195Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:28.5814291Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:28.5814393Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:28.5814478Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:28.5814584Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:28.5814670Z #define __ELF__ 1
2025-05-07T20:26:28.5814769Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:28.5814867Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:28.5814957Z #define STA_INS 0x0010
2025-05-07T20:26:28.5815054Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:28.5815228Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:28.5815327Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:28.5815419Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:28.5815527Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:28.5815641Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:28.5815739Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:28.5815940Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:28.5816039Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:28.5816215Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:28.5816401Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:28.5816499Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:28.5816820Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:28.5816952Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:28.5817167Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:28.5817250Z #define __FLT_RADIX__ 2
2025-05-07T20:26:28.5817357Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:28.5817520Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:28.5817620Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:28.5817715Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:28.5817815Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:28.5817925Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:28.5818020Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:28.5818118Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:28.5818208Z #define WORD_BIT 32
2025-05-07T20:26:28.5818294Z #define _IO_USER_BUF 1
2025-05-07T20:26:28.5818384Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:28.5818490Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:28.5818598Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:28.5818699Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:28.5818793Z #define __long_double_t long double
2025-05-07T20:26:28.5818889Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:28.5818986Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:28.5819386Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:28.5819468Z #define __k8 1
2025-05-07T20:26:28.5819664Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:28.5819835Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:28.5819949Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:28.5820052Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:28.5820146Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:28.5820249Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:28.5820341Z #define __blksize_t_defined 
2025-05-07T20:26:28.5820432Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:28.5820534Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:28.5820646Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:28.5820742Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:28.5820851Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:28.5820943Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:28.5821034Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:28.5821293Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:28.5821632Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:28.5821738Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:28.5821833Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:28.5821914Z #define SEEK_SET 0
2025-05-07T20:26:28.5822018Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:28.5822111Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:26:28.5822300Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:28.5822409Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:28.5822510Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:28.5822607Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:28.5822700Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:28.5823014Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:28.5823117Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:28.5823211Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:28.5823299Z #define __stub_sigreturn 
2025-05-07T20:26:28.5823631Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:28.5823727Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:28.5823816Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:28.5823917Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:28.5823999Z #define CLOCK_TAI 11
2025-05-07T20:26:28.5824102Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:28.5824193Z #define __restrict_arr 
2025-05-07T20:26:28.5824302Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:28.5824518Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:28.5825039Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:28.5825221Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:28.5825311Z #define __USE_MISC 1
2025-05-07T20:26:28.5825420Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:28.5825517Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:28.5825609Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:28.5825711Z #define __LDBL_DIG__ 18
2025-05-07T20:26:28.5825808Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:28.5825907Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:28.5826005Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:28.5826108Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:28.5826209Z #define __x86_64__ 1
2025-05-07T20:26:28.5826302Z #define _SIZE_T_ 
2025-05-07T20:26:28.5827188Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:28.5827293Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:28.5827391Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:28.5827504Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:28.5827626Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:28.5827718Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:28.5827826Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:28.5827952Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:28.5828088Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:28.5828183Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:28.5828650Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:28.5828773Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:28.5828923Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:28.5829020Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:28.5829117Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:28.5829209Z #define STA_FLL 0x0008
2025-05-07T20:26:28.5829350Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:28.5829443Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:28.5829567Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5829674Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:28.5829765Z #define __stub_revoke 
2025-05-07T20:26:28.5829854Z #define __timer_t_defined 1
2025-05-07T20:26:28.5829985Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:28.5830089Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:28.5830194Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:28.5830297Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:28.5830401Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:28.5830501Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:28.5830606Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:28.5830709Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:28.5830939Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:28.5831033Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:28.5831124Z #define _IO_off_t __off_t
2025-05-07T20:26:28.5831208Z #define __FLT64_DIG__ 15
2025-05-07T20:26:28.5831429Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:28.5831523Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:28.5831647Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:28.5831775Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:28.5831950Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:28.5832050Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:28.5832137Z #define NULL __null
2025-05-07T20:26:28.5832264Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:28.5832365Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:28.5832466Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:28.5832557Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:28.5832662Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:28.5832744Z #define FP_ZERO 2
2025-05-07T20:26:28.5832837Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:28.5832993Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:28.5833098Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5833179Z #define __WCHAR_T__ 
2025-05-07T20:26:28.5833276Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:28.5833467Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:28.5833620Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:28.5833720Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:28.5833838Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:28.5833951Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:28.5834083Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:28.5834208Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:28.5834305Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:28.5834400Z #define _SIGSET_H_types 1
2025-05-07T20:26:28.5834512Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:28.5834623Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:28.5834768Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:28.5834865Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:28.5834988Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:28.5835113Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:28.5835216Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:28.5835353Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:28.5835523Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:28.5835624Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:28.5835726Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:28.5835820Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:28.5835911Z #define STA_MODE 0x4000
2025-05-07T20:26:28.5836022Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:28.5836144Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:28.5836274Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:28.5836384Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:28.5836478Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:28.5836591Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:28.5836684Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:28.5836798Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:28.5836886Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:28.5837005Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:28.5837088Z #define __SEG_FS 1
2025-05-07T20:26:28.5837175Z #define _IO_size_t size_t
2025-05-07T20:26:28.5837269Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:28.5837370Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:28.5837454Z #define __stub_lchmod 
2025-05-07T20:26:28.5837544Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:28.5837661Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5837839Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:28.5837924Z #define __SEG_GS 1
2025-05-07T20:26:28.5838108Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:28.5838196Z #define _IOS_APPEND 8
2025-05-07T20:26:28.5838296Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:28.5838800Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:28.5838957Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:28.5839093Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:28.5839455Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:28.5839539Z #define htole16(x) (x)
2025-05-07T20:26:28.5839655Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:28.5839749Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:28.5839839Z #define __INT16_TYPE__ short int
2025-05-07T20:26:28.5839947Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:28.5840057Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:28.5840169Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:28.5840301Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:28.5840388Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:28.5840484Z #define __WCLONE 0x80000000
2025-05-07T20:26:28.5840575Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:28.5840656Z #define SEEK_HOLE 4
2025-05-07T20:26:28.5840750Z #define TIMER_ABSTIME 1
2025-05-07T20:26:28.5840842Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:28.5840933Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:28.5841115Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:28.5841233Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5841330Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:28.5841448Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:28.5841543Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:28.5841670Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:28.5841761Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:28.5841842Z #define linux 1
2025-05-07T20:26:28.5841943Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:28.5842053Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:28.5842147Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:28.5842246Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:28.5842353Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:28.5842496Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:28.5842595Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:28.5842689Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:28.5842787Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:28.5842881Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:28.5842961Z #define htole64(x) (x)
2025-05-07T20:26:28.5843064Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:28.5843186Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:28.5843279Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:28.5843886Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:28.5843975Z #define __USE_POSIX2 1
2025-05-07T20:26:28.5844071Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:28.5844163Z #define __WALL 0x40000000
2025-05-07T20:26:28.5844259Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:28.5844340Z #define _XLOCALE_H 1
2025-05-07T20:26:28.5844441Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:28.5844537Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:28.5844630Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:28.5844742Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:28.5844833Z #define __EXCEPTIONS 1
2025-05-07T20:26:28.5844938Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:28.5845128Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:28.5845211Z #define __WORDSIZE 64
2025-05-07T20:26:28.5845312Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:28.5845398Z #define _STL_RELOPS_H 1
2025-05-07T20:26:28.5845489Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:28.5845762Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:28.5845861Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:28.5845955Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:28.5846059Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:28.5846358Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:28.5846590Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:28.5846713Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:28.5846888Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:28.5846996Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:28.5847106Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:28.5847205Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:28.5847315Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:28.5847493Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:28.5847590Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:28.5847692Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:28.5847794Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:28.5847969Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:28.5848081Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:28.5848164Z #define _STRING_H 1
2025-05-07T20:26:28.5848270Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:28.5848358Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:28.5848453Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:28.5848593Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:28.5848688Z #define __code_model_small__ 1
2025-05-07T20:26:28.5848776Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:28.5848881Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:28.5848993Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:28.5849090Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:28.5849197Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:28.5849536Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:28.5849638Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:28.5849721Z #define le64toh(x) (x)
2025-05-07T20:26:28.5849811Z #define FILENAME_MAX 4096
2025-05-07T20:26:28.5849966Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:28.5850078Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:28.5850158Z #define L_cuserid 9
2025-05-07T20:26:28.5850252Z #define __ino_t_defined 
2025-05-07T20:26:28.5850338Z #define __k8__ 1
2025-05-07T20:26:28.5850432Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:28.5850544Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:28.5850632Z #define __int8_t_defined 
2025-05-07T20:26:28.5850727Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:28.5850825Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:28.5850936Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:28.5851041Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:28.5851128Z #define _IOS_TRUNC 16
2025-05-07T20:26:28.5851246Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:28.5851402Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:28.5851484Z #define __HAVE_COLUMN 
2025-05-07T20:26:28.5851570Z #define __stub_fdetach 
2025-05-07T20:26:28.5851980Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:28.5852066Z #define __pic__ 2
2025-05-07T20:26:28.5852195Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:28.5852291Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:28.5852382Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:28.5852488Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:28.5852573Z #define __stub_chflags 
2025-05-07T20:26:28.5852661Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:28.5852750Z #define __need_IOV_MAX 
2025-05-07T20:26:28.5852856Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:28.5853047Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:28.5853155Z #define __cpp_decltype 200707L
2025-05-07T20:26:28.5853253Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:28.5853344Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:28.5853456Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:28.5853543Z #define TTY_NAME_MAX 32
2025-05-07T20:26:28.5853708Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:28.5853826Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5854071Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:28.5854188Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:28.5854282Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:28.5854372Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:28.5854460Z #define __import__ 
2025-05-07T20:26:28.5854548Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:28.5854681Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:28.5854774Z #define __export__ 
2025-05-07T20:26:28.5854889Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:28.5854989Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:28.5855159Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:28.5855254Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:28.5855345Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:28.5855440Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:28.5855528Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:28.5855650Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:28.5855775Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:28.5855877Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:28.5855971Z #define WNOWAIT 0x01000000
2025-05-07T20:26:28.5856050Z #define PLOSS 6
2025-05-07T20:26:28.5856149Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:28.5856461Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:28.5856555Z #define EXIT_SUCCESS 0
2025-05-07T20:26:28.5856654Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:28.5856744Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:28.5856841Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:28.5856936Z #define __thread__ __thread
2025-05-07T20:26:28.5857029Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:28.5857122Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:28.5857233Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:28.5857453Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:28.5857568Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:28.5857667Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:28.5857748Z #define __linux__ 1
2025-05-07T20:26:28.5857842Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:28.5857975Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:28.5858065Z #define __S16_TYPE short int
2025-05-07T20:26:28.5858427Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:28.5858533Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:28.5858719Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:28.5858821Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:28.5858916Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:28.5858997Z #define _T_SIZE_ 
2025-05-07T20:26:28.5859106Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:28.5859224Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:28.5859318Z #define _PSTL_VERSION 12000
2025-05-07T20:26:28.5859443Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:28.5859536Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:28.5859637Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:28.5859765Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:28.5859849Z #define _IOS_INPUT 1
2025-05-07T20:26:28.5859948Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:28.5860139Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:28.5860230Z #define __INT64_TYPE__ long int
2025-05-07T20:26:28.5860336Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:28.5860435Z #define __shared__ __location__(shared)
2025-05-07T20:26:28.5860525Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:28.5860685Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:28.5860774Z #define __gid_t_defined 
2025-05-07T20:26:28.5860896Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:28.5860992Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:28.5861264Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:28.5861365Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:28.5861456Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:28.5861540Z #define ___int_size_t_h 
2025-05-07T20:26:28.5861650Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:28.5861771Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:28.5861932Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:28.5862043Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:28.5862135Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:28.5862239Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:28.5862335Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:28.5862456Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5862574Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:28.5862691Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:28.5862779Z #define __clock_t_defined 1
2025-05-07T20:26:28.5862893Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:28.5863000Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:28.5863089Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:28.5863189Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:28.5863288Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:28.5863394Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:28.5863488Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:28.5863660Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:28.5863750Z #define __SSE__ 1
2025-05-07T20:26:28.5863844Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:28.5863938Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:28.5864026Z #define _CTYPE_H 1
2025-05-07T20:26:28.5864115Z #define __sigset_t_defined 
2025-05-07T20:26:28.5864209Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:28.5864307Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:28.5864391Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:28.5864490Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:28.5864586Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:28.5864668Z #define __SM_70_RT_H__ 
2025-05-07T20:26:28.5864758Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:28.5864866Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:28.5864957Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:28.5865123Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:28.5865220Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:28.5865326Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:28.5865425Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:28.5865514Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:28.5865596Z #define __amd64__ 1
2025-05-07T20:26:28.5865691Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:28.5865794Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:28.5866057Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:28.5866163Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:28.5866242Z #define EOF (-1)
2025-05-07T20:26:28.5866341Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:28.5866431Z #define __USE_POSIX199309 1
2025-05-07T20:26:28.5866525Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:28.5866623Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:28.5866718Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:28.5866813Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:28.5867015Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:28.5867107Z #define ____mbstate_t_defined 1
2025-05-07T20:26:28.5867194Z #define STA_NANO 0x2000
2025-05-07T20:26:28.5867296Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:28.5867387Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:28.5867470Z #define _IO_LINKED 0x80
2025-05-07T20:26:28.5867570Z #define __cpp_lib_launder 201606
2025-05-07T20:26:28.5867661Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:28.5867763Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:28.5867860Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:28.5868033Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:28.5868182Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:28.5868287Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:28.5868385Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:28.5868484Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:28.5868576Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:28.5868663Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:28.5868808Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:28.5868925Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:28.5869127Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:28.5869313Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:28.5869397Z #define __stub_stty 
2025-05-07T20:26:28.5869571Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:28.5869656Z #define le16toh(x) (x)
2025-05-07T20:26:28.5869767Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:28.5869945Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:28.5870025Z #define _SIZET_ 
2025-05-07T20:26:28.5870113Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:28.5870201Z #define _SVID_SOURCE 1
2025-05-07T20:26:28.5870279Z #define _LP64 1
2025-05-07T20:26:28.5870365Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:28.5870619Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:28.5870728Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:28.5870820Z #define __UINT8_C(c) c
2025-05-07T20:26:28.5870912Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:28.5871003Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:28.5871119Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:28.5883903Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:28.5884046Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:28.5884148Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:28.5884265Z #define CUDARTAPI 
2025-05-07T20:26:28.5884350Z #define IOV_MAX 1024
2025-05-07T20:26:28.5884504Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:28.5884613Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:28.5884718Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:28.5884803Z #define __wchar_t__ 
2025-05-07T20:26:28.5884916Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:28.5884999Z #define SEEK_END 2
2025-05-07T20:26:28.5885097Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:28.5885280Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:28.5885381Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:28.5885531Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:28.5885622Z #define ____FILE_defined 1
2025-05-07T20:26:28.5885740Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:28.5885846Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:28.5885937Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:28.5886045Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:28.5886345Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:28.5886474Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:28.5886558Z #define _IO_RIGHT 04
2025-05-07T20:26:28.5886661Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:28.5886848Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:28.5887092Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:28.5887214Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:28.5887310Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:28.5887416Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:28.5887498Z #define _STDDEF_H_ 
2025-05-07T20:26:28.5887670Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:28.5887774Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:28.5887892Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:28.5888182Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:28.5888300Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:28.5888443Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:28.5888571Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:28.5888675Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:28.5888787Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:28.5888899Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:28.5889013Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:28.5889108Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:28.5889211Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:28.5889307Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:28.5889487Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:28.5889582Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:28.5889760Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:28.5889870Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:28.5889963Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:28.5890104Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:28.5890204Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:28.5890296Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:28.5890396Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:28.5890495Z #define P_tmpdir "/tmp"
2025-05-07T20:26:28.5890618Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:28.5890710Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:28.5890819Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:28.5890981Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:28.5891156Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:28.5891255Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:28.5891378Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:28.5891494Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:28.5891602Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:28.5891829Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:28.5891933Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:28.5892044Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:28.5892137Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:28.5892232Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:28.5892324Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:28.5892421Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:28.5892523Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:28.5892602Z #define __FXSR__ 1
2025-05-07T20:26:28.5892688Z #define _SIZE_T 
2025-05-07T20:26:28.5892789Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:28.5892900Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:28.5893072Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:28.5893218Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:28.5893315Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:28.5893418Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:28.5893599Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:28.5893798Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:28.5893893Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:28.5894016Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:28.5894233Z #define FOPEN_MAX 16
2025-05-07T20:26:28.5894327Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:28.5894444Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:28.5894548Z #define __suseconds_t_defined 
2025-05-07T20:26:28.5894635Z #define __off_t_defined 
2025-05-07T20:26:28.5894719Z #define stderr stderr
2025-05-07T20:26:28.5894821Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:28.5894930Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:28.5895027Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:28.5895205Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:28.5895614Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:28.5895714Z #define __mode_t_defined 
2025-05-07T20:26:28.5895801Z #define _GCC_SIZE_T 
2025-05-07T20:26:28.5895903Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:28.5896016Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:28.5896129Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:28.5896223Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:28.5896321Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:28.5896424Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:28.5896531Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:28.5896641Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:28.5896732Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:28.5896813Z #define __size_t__ 
2025-05-07T20:26:28.5896948Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:28.5897047Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:28.5897167Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:28.5897317Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:28.5897411Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:28.5897590Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:28.5897679Z #define _ENDIAN_H 1
2025-05-07T20:26:28.5897784Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:28.5897892Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:28.5897996Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:28.5898076Z #define __try try
2025-05-07T20:26:28.5898179Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:28.5898276Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:28.5898374Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:28.5898633Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:28.5898722Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:28.5898808Z #define __PIC__ 2
2025-05-07T20:26:28.5898925Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:28.5899046Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:28.5899182Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:28.5899280Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:28.5899376Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:28.5899568Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:28.5899671Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:28.5899769Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:28.5899867Z #define _IO_uid_t __uid_t
2025-05-07T20:26:28.5899962Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:28.5900101Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:28.5900193Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:28.5900336Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:28.5900443Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:28.5900563Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:28.5900652Z #define LONG_BIT 64
2025-05-07T20:26:28.5900764Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:28.5900863Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:28.5900987Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:28.5901085Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:28.5901175Z #define __blkcnt_t_defined 
2025-05-07T20:26:28.5901539Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:28.5901631Z #define __USE_LARGEFILE 1
2025-05-07T20:26:28.5901727Z #define __cpp_constexpr 201603L
2025-05-07T20:26:28.5901830Z #define CUDART_VERSION 12060
2025-05-07T20:26:28.5901920Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:28.5902021Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:28.5902117Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:28.5902313Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:28.5902406Z #define __lldiv_t_defined 1
2025-05-07T20:26:28.5902567Z #define __SSE2__ 1
2025-05-07T20:26:28.5902650Z #define _IOLBF 1
2025-05-07T20:26:28.5902749Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:28.5902852Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:28.5902956Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:28.5903060Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:28.5903169Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:28.5903256Z #define __INT32_TYPE__ int
2025-05-07T20:26:28.5903360Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:28.5903464Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:28.5903563Z #define __cpp_exceptions 199711L
2025-05-07T20:26:28.5903663Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:28.5903772Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:28.5903861Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:28.5903979Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:28.5904138Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:28.5904245Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:28.5904341Z #define __SWORD_TYPE long int
2025-05-07T20:26:28.5904439Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:28.5904540Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:28.5904633Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:28.5904724Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:28.5905010Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:28.5905108Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:28.5905253Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:28.5905340Z #define _T_SIZE 
2025-05-07T20:26:28.5905447Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:28.5905572Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:28.5905699Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:28.5905790Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:28.5905885Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:28.5906005Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:28.5906102Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:28.5906206Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:28.5906298Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:28.5906474Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:28.5906573Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:28.5906672Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:28.5906769Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:28.5906891Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:28.5906974Z #define __PIE__ 2
2025-05-07T20:26:28.5907079Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:28.5907182Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:28.5907373Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:28.5907599Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:28.5907692Z #define __nlink_t_defined 
2025-05-07T20:26:28.5907823Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:28.5907945Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:28.5908034Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:28.5908295Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:28.5908418Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:28.5908519Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:28.5908755Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:28.5908852Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:28.5908941Z #define __FILE_defined 1
2025-05-07T20:26:28.5909126Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:28.5909222Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:28.5909318Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:28.5909434Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:28.5909547Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:28.5909735Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:28.5909845Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:28.5909933Z #define __INT16_C(c) c
2025-05-07T20:26:28.5910032Z #define __U32_TYPE unsigned int
2025-05-07T20:26:28.5910140Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:28.5910262Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:28.5910348Z #define __STDC__ 1
2025-05-07T20:26:28.5910451Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:28.5910551Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:28.5910655Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:28.5910807Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:28.5910895Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:28.5910999Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:28.5911097Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:28.5911208Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:28.5911322Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:28.5911425Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:28.5911534Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:28.5911617Z #define stdin stdin
2025-05-07T20:26:28.5911707Z #define __ino64_t_defined 
2025-05-07T20:26:28.5911797Z #define STA_CLK 0x8000
2025-05-07T20:26:28.5911889Z #define __clockid_t_defined 1
2025-05-07T20:26:28.5912033Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:28.5912208Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:28.5912309Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:28.5912411Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:28.5912518Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:28.5912625Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:28.5912821Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:28.5912917Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:28.5913452Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:28.5913546Z #define DOMAIN 1
2025-05-07T20:26:28.5913639Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:28.5913723Z #define __NVCC__ 1
2025-05-07T20:26:28.5913838Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:28.5913959Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:28.5914062Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:28.5914172Z #define __throw_exception_again throw
2025-05-07T20:26:28.5914264Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:28.5914359Z #define __EXCEPTION_H 1
2025-05-07T20:26:28.5914454Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:28.5914557Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:28.5914865Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:28.5914978Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:28.5915077Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:28.5915183Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:28.5915284Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:28.5915381Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:28.5915530Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:28.5915636Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:28.5915838Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:28.5915934Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:28.5916039Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:28.5916164Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:28.5916277Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:28.5916423Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:28.5916527Z #define __useconds_t_defined 
2025-05-07T20:26:28.5916625Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:28.5916884Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:28.5917040Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:28.5917126Z #define __SSE_MATH__ 1
2025-05-07T20:26:28.5917213Z #define _IO_wint_t wint_t
2025-05-07T20:26:28.5917314Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:28.5917403Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:28.5917504Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:28.5917624Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:28.5917721Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:28.5917819Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:28.5917904Z #define __USE_ATFILE 1
2025-05-07T20:26:28.5917994Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:28.5918094Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:28.5918183Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:28.5918408Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:28.5918511Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:28.5918621Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:28.5918730Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:28.5918839Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:28.5918922Z #define _STDLIB_H 1
2025-05-07T20:26:28.5919066Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:28.5919161Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:28.5919255Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:28.5919397Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:28.5919504Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:28.5919599Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:28.5919789Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:28.5919944Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:28.5920058Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:28.5920172Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:28.5920263Z #define __ldiv_t_defined 1
2025-05-07T20:26:28.5920454Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:28.5920546Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:28.5920713Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:28.5920822Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:28.5920913Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:28.5921012Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:28.5921123Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:28.5921220Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:28.5921301Z #define CUDART_CB 
2025-05-07T20:26:28.5921409Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:28.5921530Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:28.5921625Z #define MB_LEN_MAX 16
2025-05-07T20:26:28.5921847Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:28.5921947Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:28.5922079Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:28.5922196Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:28.5922290Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:28.5922446Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:28.5922551Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:28.5922635Z #define _GNU_SOURCE 1
2025-05-07T20:26:28.5922726Z #define __stub_putmsg 
2025-05-07T20:26:28.5922808Z #define __CUDACC__ 1
2025-05-07T20:26:28.5922989Z #define __N(msgid) (msgid)
2025-05-07T20:26:28.5923078Z #define __P(args) args
2025-05-07T20:26:28.5923330Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:28.5923438Z #define __cpp_init_captures 201304L
2025-05-07T20:26:28.5923542Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:28.5923770Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:28.5923874Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:28.5923955Z #define __WCHAR_T 
2025-05-07T20:26:28.5924127Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:28.5924227Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:28.5924343Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:28.5924442Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:28.5924456Z 
2025-05-07T20:26:28.6209762Z 
2025-05-07T20:26:28.6210607Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:28.6210626Z 
2025-05-07T20:26:30.5355558Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:30.5355962Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:26:30.5356275Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:26:30.5356584Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:26:30.5356918Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:26:30.5357128Z 
2025-05-07T20:26:30.6012413Z 
2025-05-07T20:26:30.6022863Z /usr/bin/nvidia-smi
2025-05-07T20:26:30.6028328Z + nvidia-smi
2025-05-07T20:26:30.6028488Z 
2025-05-07T20:26:30.6209461Z Wed May  7 20:26:30 2025       
2025-05-07T20:26:30.6209998Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:30.6210604Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:30.6211098Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:30.6211599Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:30.6212135Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:30.6212570Z |                                         |                        |               MIG M. |
2025-05-07T20:26:30.6212911Z |=========================================+========================+======================|
2025-05-07T20:26:30.6380879Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:30.6381512Z |  0%   27C    P8             15W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:30.6382249Z |                                         |                        |                  N/A |
2025-05-07T20:26:30.6382763Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:30.6385777Z                                                                                          
2025-05-07T20:26:30.6386593Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:30.6387257Z | Processes:                                                                              |
2025-05-07T20:26:30.6387820Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:30.6388332Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:30.6388793Z |=========================================================================================|
2025-05-07T20:26:30.6392262Z |  No running processes found                                                             |
2025-05-07T20:26:30.6393129Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:30.8780251Z 
2025-05-07T20:26:30.8785627Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:26:30.8840143Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:26:30.8840757Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:26:30.8853641Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:30.8854089Z env:
2025-05-07T20:26:30.8854402Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:30.8854864Z   BUILD_ENV: build_binary
2025-05-07T20:26:30.8855209Z   BUILD_TARGET: genai
2025-05-07T20:26:30.8855592Z   BUILD_VARIANT: cuda
2025-05-07T20:26:30.8856016Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:26:30.8856330Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:30.8856714Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:30.8857424Z ##[endgroup]
2025-05-07T20:26:31.2268715Z ################################################################################
2025-05-07T20:26:31.2269173Z # Install PyTorch (PIP)
2025-05-07T20:26:31.2269640Z #
2025-05-07T20:26:31.2285563Z # [2025-05-07T20:26:31.228Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:26:31.2286155Z ################################################################################
2025-05-07T20:26:31.2286421Z 
2025-05-07T20:26:31.2314031Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:32.2290843Z Channels:
2025-05-07T20:26:32.2291186Z  - conda-forge
2025-05-07T20:26:32.2291646Z Platform: linux-64
2025-05-07T20:26:35.6013320Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:36.3440313Z Solving environment: \ | / done
2025-05-07T20:26:36.5604767Z 
2025-05-07T20:26:36.5605250Z ## Package Plan ##
2025-05-07T20:26:36.5605674Z 
2025-05-07T20:26:36.5605937Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:36.5606345Z 
2025-05-07T20:26:36.5606473Z   added / updated specs:
2025-05-07T20:26:36.5606865Z     - numpy
2025-05-07T20:26:36.5607066Z 
2025-05-07T20:26:36.5607084Z 
2025-05-07T20:26:36.5607309Z The following packages will be downloaded:
2025-05-07T20:26:36.5607555Z 
2025-05-07T20:26:36.5607703Z     package                    |            build
2025-05-07T20:26:36.5608201Z     ---------------------------|-----------------
2025-05-07T20:26:36.5608974Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:36.5609489Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:36.5610015Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:36.5610652Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:36.5611205Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:36.5611716Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:36.5612349Z     numpy-2.2.5                |  py311h5d046bc_0         8.6 MB  conda-forge
2025-05-07T20:26:36.5612828Z     ------------------------------------------------------------
2025-05-07T20:26:36.5613502Z                                            Total:        15.9 MB
2025-05-07T20:26:36.5613793Z 
2025-05-07T20:26:36.5613953Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:36.5614242Z 
2025-05-07T20:26:36.5614487Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:36.5615136Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:36.5615743Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:36.5616406Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:36.5617142Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:36.5617790Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:36.5618697Z   numpy              conda-forge/linux-64::numpy-2.2.5-py311h5d046bc_0 
2025-05-07T20:26:36.5618991Z 
2025-05-07T20:26:36.5618996Z 
2025-05-07T20:26:36.5619000Z 
2025-05-07T20:26:36.5619226Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:36.5619760Z numpy-2.2.5          | 8.6 MB    |            |   0% 
2025-05-07T20:26:36.5620279Z 
2025-05-07T20:26:36.5620746Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:36.5621005Z 
2025-05-07T20:26:36.5621014Z 
2025-05-07T20:26:36.5628406Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:36.5628701Z 
2025-05-07T20:26:36.5628706Z 
2025-05-07T20:26:36.5628710Z 
2025-05-07T20:26:36.5637213Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:36.5637759Z 
2025-05-07T20:26:36.5637763Z 
2025-05-07T20:26:36.5637767Z 
2025-05-07T20:26:36.5643228Z 
2025-05-07T20:26:36.5657145Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:36.5657454Z 
2025-05-07T20:26:36.5657468Z 
2025-05-07T20:26:36.5657472Z 
2025-05-07T20:26:36.5657476Z 
2025-05-07T20:26:36.5669780Z 
2025-05-07T20:26:36.5713533Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:36.5713892Z 
2025-05-07T20:26:36.5713896Z 
2025-05-07T20:26:36.5713900Z 
2025-05-07T20:26:36.5713904Z 
2025-05-07T20:26:36.5713907Z 
2025-05-07T20:26:36.5713911Z 
2025-05-07T20:26:36.6286941Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:36.6287322Z 
2025-05-07T20:26:36.6287328Z 
2025-05-07T20:26:36.6298465Z 
2025-05-07T20:26:36.6455841Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:36.6456138Z 
2025-05-07T20:26:36.6456158Z 
2025-05-07T20:26:36.6456162Z 
2025-05-07T20:26:36.6456166Z 
2025-05-07T20:26:36.7150035Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:36.7150331Z 
2025-05-07T20:26:36.7150335Z 
2025-05-07T20:26:36.7150339Z 
2025-05-07T20:26:36.7150343Z 
2025-05-07T20:26:36.7150347Z 
2025-05-07T20:26:36.7459310Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:36.7459693Z 
2025-05-07T20:26:36.7459697Z 
2025-05-07T20:26:36.7459701Z 
2025-05-07T20:26:36.7459704Z 
2025-05-07T20:26:36.7459708Z 
2025-05-07T20:26:36.7459766Z 
2025-05-07T20:26:36.7532856Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:36.7533180Z 
2025-05-07T20:26:36.7533184Z 
2025-05-07T20:26:36.7533188Z 
2025-05-07T20:26:36.7533191Z 
2025-05-07T20:26:36.7588031Z 
2025-05-07T20:26:36.7845819Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:36.7846106Z 
2025-05-07T20:26:36.7846113Z 
2025-05-07T20:26:36.7846211Z 
2025-05-07T20:26:36.7846218Z 
2025-05-07T20:26:36.7846232Z 
2025-05-07T20:26:36.7846286Z 
2025-05-07T20:26:36.9013572Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:36.9013899Z 
2025-05-07T20:26:36.9013955Z 
2025-05-07T20:26:36.9013959Z 
2025-05-07T20:26:36.9013963Z 
2025-05-07T20:26:36.9014257Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:36.9014547Z 
2025-05-07T20:26:36.9014551Z 
2025-05-07T20:26:36.9014555Z 
2025-05-07T20:26:36.9014559Z 
2025-05-07T20:26:36.9062877Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:36.9063145Z 
2025-05-07T20:26:36.9063149Z 
2025-05-07T20:26:36.9063153Z 
2025-05-07T20:26:36.9063157Z 
2025-05-07T20:26:36.9063161Z 
2025-05-07T20:26:36.9150069Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:36.9150340Z 
2025-05-07T20:26:36.9150344Z 
2025-05-07T20:26:36.9150348Z 
2025-05-07T20:26:36.9150711Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:36.9150990Z 
2025-05-07T20:26:36.9150998Z 
2025-05-07T20:26:36.9151002Z 
2025-05-07T20:26:36.9160956Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:36.9161232Z 
2025-05-07T20:26:36.9161236Z 
2025-05-07T20:26:36.9161240Z 
2025-05-07T20:26:36.9161542Z 
2025-05-07T20:26:36.9161547Z 
2025-05-07T20:26:36.9161550Z 
2025-05-07T20:26:36.9331250Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:36.9331523Z 
2025-05-07T20:26:36.9422566Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:36.9422823Z 
2025-05-07T20:26:36.9422827Z 
2025-05-07T20:26:36.9727289Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:26:36.9795995Z numpy-2.2.5          | 8.6 MB    |            |   0% 
2025-05-07T20:26:36.9796234Z 
2025-05-07T20:26:36.9796923Z 
2025-05-07T20:26:37.0089620Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:37.0091195Z 
2025-05-07T20:26:37.0430653Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:37.0430907Z 
2025-05-07T20:26:37.0430911Z 
2025-05-07T20:26:37.0432905Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:37.0433163Z 
2025-05-07T20:26:37.0433167Z 
2025-05-07T20:26:37.0815727Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:37.0967044Z numpy-2.2.5          | 8.6 MB    | ########2  |  82% 
2025-05-07T20:26:37.1220058Z numpy-2.2.5          | 8.6 MB    | ########## | 100% 
2025-05-07T20:26:37.1220516Z 
2025-05-07T20:26:37.1221918Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:37.1222400Z 
2025-05-07T20:26:37.5345521Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:37.5354475Z numpy-2.2.5          | 8.6 MB    | ########## | 100% 
2025-05-07T20:26:37.5354815Z                                                      
2025-05-07T20:26:37.5355013Z 
2025-05-07T20:26:37.5355203Z                                                      [A
2025-05-07T20:26:37.5355415Z 
2025-05-07T20:26:37.5355420Z 
2025-05-07T20:26:37.5355584Z                                                      [A[A
2025-05-07T20:26:37.5355796Z 
2025-05-07T20:26:37.5355800Z 
2025-05-07T20:26:37.5355804Z 
2025-05-07T20:26:37.5355981Z                                                      [A[A[A
2025-05-07T20:26:37.5356195Z 
2025-05-07T20:26:37.5356199Z 
2025-05-07T20:26:37.5356203Z 
2025-05-07T20:26:37.5356213Z 
2025-05-07T20:26:37.5356387Z                                                      [A[A[A[A
2025-05-07T20:26:37.5356608Z 
2025-05-07T20:26:37.5356611Z 
2025-05-07T20:26:37.5356615Z 
2025-05-07T20:26:37.5356619Z 
2025-05-07T20:26:37.5356623Z 
2025-05-07T20:26:37.5356798Z                                                      [A[A[A[A[A
2025-05-07T20:26:37.5357021Z 
2025-05-07T20:26:37.5357025Z 
2025-05-07T20:26:37.5357028Z 
2025-05-07T20:26:37.5357032Z 
2025-05-07T20:26:37.5357036Z 
2025-05-07T20:26:37.5357039Z 
2025-05-07T20:26:37.5357375Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:37.6360511Z Preparing transaction: \ done
2025-05-07T20:26:37.8367706Z Verifying transaction: / - done
2025-05-07T20:26:37.9374697Z Executing transaction: | done
2025-05-07T20:26:38.1290167Z ################################################################################
2025-05-07T20:26:38.1290552Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:38.1290850Z #
2025-05-07T20:26:38.1305622Z # [2025-05-07T20:26:38.130Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:26:38.1306096Z ################################################################################
2025-05-07T20:26:38.1306321Z 
2025-05-07T20:26:38.1321116Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:38.2238924Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:38.2239491Z ################################################################################
2025-05-07T20:26:38.2239877Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:38.2240161Z #
2025-05-07T20:26:38.2260087Z # [2025-05-07T20:26:38.225Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:26:38.2260893Z ################################################################################
2025-05-07T20:26:38.2261122Z 
2025-05-07T20:26:38.2285819Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:38.2311442Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:26:38.2328158Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:38.2328937Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:26:38.2337667Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:38.2347339Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:26:38.2369526Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:56.6986810Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:56.6987290Z Collecting torch
2025-05-07T20:27:56.6987983Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:27:56.6988959Z Collecting filelock (from torch)
2025-05-07T20:27:56.6989595Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:27:56.6991020Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from torch) (4.13.2)
2025-05-07T20:27:56.6991850Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:27:56.6992586Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:27:56.6993471Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 36.5 MB/s eta 0:00:00
2025-05-07T20:27:56.6993844Z Collecting networkx (from torch)
2025-05-07T20:27:56.6994336Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:27:56.6994991Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 19.8 MB/s eta 0:00:00
2025-05-07T20:27:56.6995340Z Collecting jinja2 (from torch)
2025-05-07T20:27:56.6995814Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:27:56.6996310Z Collecting fsspec (from torch)
2025-05-07T20:27:56.6996807Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:27:56.6997381Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:27:56.6998096Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:27:56.6998876Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 73.2 MB/s eta 0:00:00
2025-05-07T20:27:56.6999299Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:27:56.7000036Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:27:56.7000810Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 9.8 MB/s eta 0:00:00
2025-05-07T20:27:56.7001206Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:27:56.7001909Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:27:56.7002680Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 40.6 MB/s eta 0:00:00
2025-05-07T20:27:56.7003065Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:27:56.7003740Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:27:56.7004733Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 36.4 MB/s eta 0:00:00
2025-05-07T20:27:56.7005136Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:27:56.7006365Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:27:56.7007242Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 66.1 MB/s eta 0:00:00
2025-05-07T20:27:56.7007621Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:27:56.7008292Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:27:56.7009052Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 153.9 MB/s eta 0:00:00
2025-05-07T20:27:56.7009549Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:27:56.7010276Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:27:56.7011288Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 208.6 MB/s eta 0:00:00
2025-05-07T20:27:56.7011693Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:27:56.7012402Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:27:56.7013177Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 148.5 MB/s eta 0:00:00
2025-05-07T20:27:56.7013566Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:27:56.7014263Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:27:56.7015043Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 144.5 MB/s eta 0:00:00
2025-05-07T20:27:56.7015429Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:27:56.7016126Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:27:56.7016912Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 161.9 MB/s eta 0:00:00
2025-05-07T20:27:56.7017292Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:27:56.7018160Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:27:56.7019003Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:27:56.7019657Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:27:56.7020326Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:27:56.7021095Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:27:56.7021967Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 144.5 MB/s eta 0:00:00
2025-05-07T20:27:56.7022358Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:27:56.7023149Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:27:56.7023957Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:27:56.7024789Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:27:56.7026093Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:27:56.7027088Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:27:56.7027638Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:27:56.7028374Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 56.7 MB/s eta 0:00:00
2025-05-07T20:27:56.7028874Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:27:56.7029768Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
2025-05-07T20:27:56.7030831Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl (825.6 MB)
2025-05-07T20:27:56.7031655Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.6/825.6 MB 36.6 MB/s eta 0:00:00
2025-05-07T20:27:56.7032421Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:27:56.7033257Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 21.0 MB/s eta 0:00:00
2025-05-07T20:27:56.7034003Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:27:56.7035085Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 103.0 MB/s eta 0:00:00
2025-05-07T20:27:56.7036054Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:27:56.7037105Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 134.1 MB/s eta 0:00:00
2025-05-07T20:27:56.7039431Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:27:56.7041057Z 
2025-05-07T20:27:56.7043060Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:27:56.7045227Z 
2025-05-07T20:27:58.9324440Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:27:58.9327344Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:02.3932401Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:05.8624040Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:05.8624518Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:09.2564083Z True
2025-05-07T20:28:09.2564337Z True
2025-05-07T20:28:09.2564443Z 
2025-05-07T20:28:09.3213109Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:09.3256212Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:09.3256827Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:09.3270474Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:09.3270863Z env:
2025-05-07T20:28:09.3271099Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:09.3271425Z   BUILD_ENV: build_binary
2025-05-07T20:28:09.3271668Z   BUILD_TARGET: genai
2025-05-07T20:28:09.3271894Z   BUILD_VARIANT: cuda
2025-05-07T20:28:09.3272130Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:09.3272378Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:09.3272679Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:09.3273202Z ##[endgroup]
2025-05-07T20:28:09.6639500Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:09.6641580Z ################################################################################
2025-05-07T20:28:09.6642202Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:09.6642568Z #
2025-05-07T20:28:09.6657372Z # [2025-05-07T20:28:09.665Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:09.6657813Z ################################################################################
2025-05-07T20:28:09.6658030Z 
2025-05-07T20:28:09.6672770Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:09.7738643Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:09.7749386Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:09.7750060Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:09.7750459Z 
2025-05-07T20:28:09.8618394Z 
2025-05-07T20:28:09.8619121Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:09.8642289Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:15.8144150Z Collecting environment information...
2025-05-07T20:28:15.8144756Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:28:15.8145238Z Is debug build: False
2025-05-07T20:28:15.8145499Z CUDA used to build PyTorch: 12.6
2025-05-07T20:28:15.8145778Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:15.8145964Z 
2025-05-07T20:28:15.8146073Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:15.8146403Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:15.8146809Z Clang version: Could not collect
2025-05-07T20:28:15.8147189Z CMake version: Could not collect
2025-05-07T20:28:15.8147752Z Libc version: glibc-2.34
2025-05-07T20:28:15.8147979Z 
2025-05-07T20:28:15.8148409Z Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:15.8149162Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:15.8149584Z Is CUDA available: True
2025-05-07T20:28:15.8149848Z CUDA runtime version: 12.6.85
2025-05-07T20:28:15.8150121Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:15.8150449Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:15.8150781Z Nvidia driver version: 570.133.07
2025-05-07T20:28:15.8151213Z cuDNN version: Could not collect
2025-05-07T20:28:15.8151484Z HIP runtime version: N/A
2025-05-07T20:28:15.8151746Z MIOpen runtime version: N/A
2025-05-07T20:28:15.8152016Z Is XNNPACK available: True
2025-05-07T20:28:15.8152175Z 
2025-05-07T20:28:15.8152253Z CPU:
2025-05-07T20:28:15.8152471Z Architecture:                         x86_64
2025-05-07T20:28:15.8152812Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:15.8153321Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:15.8153714Z Byte Order:                           Little Endian
2025-05-07T20:28:15.8154038Z CPU(s):                               16
2025-05-07T20:28:15.8154327Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:15.8155025Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:15.8155376Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:15.8155701Z CPU family:                           23
2025-05-07T20:28:15.8155979Z Model:                                49
2025-05-07T20:28:15.8156415Z Thread(s) per core:                   2
2025-05-07T20:28:15.8156937Z Core(s) per socket:                   8
2025-05-07T20:28:15.8157298Z Socket(s):                            1
2025-05-07T20:28:15.8157577Z Stepping:                             0
2025-05-07T20:28:15.8157873Z BogoMIPS:                             5599.99
2025-05-07T20:28:15.8160120Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:15.8162379Z Hypervisor vendor:                    KVM
2025-05-07T20:28:15.8162683Z Virtualization type:                  full
2025-05-07T20:28:15.8163023Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:15.8163388Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:15.8164036Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:15.8164390Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:15.8164709Z NUMA node(s):                         1
2025-05-07T20:28:15.8165017Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:15.8165346Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:15.8165835Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:15.8166196Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:15.8166555Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:15.8166900Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:15.8167408Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:15.8167778Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:15.8168321Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:15.8169312Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:15.8169930Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:15.8170622Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:15.8171476Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:15.8172157Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:15.8172509Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:15.8172746Z 
2025-05-07T20:28:15.8172849Z Versions of relevant libraries:
2025-05-07T20:28:15.8173117Z [pip3] numpy==2.2.5
2025-05-07T20:28:15.8173364Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:28:15.8173672Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:28:15.8173984Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:28:15.8174422Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:28:15.8174823Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:28:15.8175114Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:28:15.8175412Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:28:15.8175710Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:28:15.8176141Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:28:15.8176602Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:15.8176897Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:15.8177181Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:28:15.8177478Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:28:15.8177760Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:15.8178082Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:28:15.8178456Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:15.8178938Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:15.8179575Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:15.8180100Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:15.8180783Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:15.8181314Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:15.8181799Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:28:15.8182393Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:28:15.8182876Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:15.8183369Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:15.8183848Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:15.8184440Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:15.8184905Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:15.8185354Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:15.8185835Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:15.8186314Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:28:15.8186763Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:15.8187370Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:15.8187836Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:28:15.8188298Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:28:15.8188756Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:28:15.8189220Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:28:15.8189811Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:15.8190304Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:15.8190787Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:28:15.8191390Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:28:15.8191879Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:15.8192360Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:15.8192825Z [conda] numpy                     2.2.5           py311h5d046bc_0    conda-forge
2025-05-07T20:28:15.8193288Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:28:15.8193907Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:28:15.8194406Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:28:15.8194913Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:28:15.8195520Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:28:15.8196090Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:28:15.8196566Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:28:15.8197056Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:28:15.8197549Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:28:15.8198043Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:15.8198529Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:15.8199012Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:28:15.8199487Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:28:15.8200222Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:15.8200687Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:28:15.8200957Z 
2025-05-07T20:28:15.8942407Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:15.8943075Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:15.8955737Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:15.8956088Z env:
2025-05-07T20:28:15.8956315Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:15.8956607Z   BUILD_ENV: build_binary
2025-05-07T20:28:15.8956857Z   BUILD_TARGET: genai
2025-05-07T20:28:15.8957092Z   BUILD_VARIANT: cuda
2025-05-07T20:28:15.8957338Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:15.8957585Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:15.8957894Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:15.8958228Z ##[endgroup]
2025-05-07T20:28:16.2397725Z ################################################################################
2025-05-07T20:28:16.2398082Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:16.2398326Z #
2025-05-07T20:28:16.2415132Z # [2025-05-07T20:28:16.241Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:16.2415552Z ################################################################################
2025-05-07T20:28:16.2415771Z 
2025-05-07T20:28:16.2430789Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:16.3388290Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:16.3412330Z [BUILD] Running git submodules update ...
2025-05-07T20:28:16.3436057Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:16.3802553Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:16.3803474Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:16.3804511Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:16.3805304Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:16.3806100Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:16.3806942Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:16.3807733Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:16.3836293Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:16.4390703Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:16.4413285Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:18.8859355Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:18.9046586Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:19.0034016Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:19.0069814Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:19.2169016Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:19.2214757Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:19.3254154Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:19.3291012Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:19.6313822Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:19.6352361Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:19.6877707Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:19.6881391Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:19.7582185Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:19.7625797Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:19.8049789Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:19.8626074Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:19.8721787Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:19.9871320Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:19.9904738Z   Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:20.0924969Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:20.0973417Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:20.1490569Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:20.2105295Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:20.2150248Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:20.3081186Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:20.3113195Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:20.4127630Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:20.4177312Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:20.5266835Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:20.5298356Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:20.6265302Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:20.6297801Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:20.7307227Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:20.7340691Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:20.8373737Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:20.8405913Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:20.8993252Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:20.9491394Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:20.9533043Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:20.9928789Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:21.0385289Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:21.0418653Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:21.0863246Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:21.1476788Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:21.1512629Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:21.2000184Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:21.2494547Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:21.3079769Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:21.9113365Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 46.2 MB/s eta 0:00:00
2025-05-07T20:28:21.9147356Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:21.9782501Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:22.0387960Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:22.0986501Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:22.1615561Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:22.2239390Z Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB)
2025-05-07T20:28:22.2837138Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 763.0/763.0 kB 8.7 MB/s eta 0:00:00
2025-05-07T20:28:22.2901625Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:22.3431334Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:22.3934085Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:22.4573193Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:22.5207913Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:22.5725256Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:22.6336334Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:22.6941627Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:22.7479926Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:22.8079460Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:22.9900829Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:25.3914093Z 
2025-05-07T20:28:25.3941825Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:28:25.5768861Z ################################################################################
2025-05-07T20:28:25.5769348Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:25.5769705Z #
2025-05-07T20:28:25.5788581Z # [2025-05-07T20:28:25.578Z] + install_triton_pip build_binary
2025-05-07T20:28:25.5789137Z ################################################################################
2025-05-07T20:28:25.5789476Z 
2025-05-07T20:28:25.5789823Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:25.5790438Z ################################################################################
2025-05-07T20:28:25.5790975Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:25.5791413Z #
2025-05-07T20:28:25.5809483Z # [2025-05-07T20:28:25.580Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:25.5810000Z ################################################################################
2025-05-07T20:28:25.5810214Z 
2025-05-07T20:28:25.5827848Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:25.6746242Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:25.6746793Z ################################################################################
2025-05-07T20:28:25.6747267Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:25.6747635Z #
2025-05-07T20:28:25.6767709Z # [2025-05-07T20:28:25.676Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:25.6768191Z ################################################################################
2025-05-07T20:28:25.6768821Z 
2025-05-07T20:28:25.6815564Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:25.6831064Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:25.6831574Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:25.6840166Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:25.6850219Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:25.6870403Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:33.3933499Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:33.3934875Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:33.3935611Z 
2025-05-07T20:28:33.3935823Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:33.3936241Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:33.3937042Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:33.3938245Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:28:33.3939781Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 55.0 MB/s eta 0:00:00
2025-05-07T20:28:33.3940167Z Installing collected packages: pytorch-triton
2025-05-07T20:28:33.3940516Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:33.3940904Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:33.3941329Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:33.3941755Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:33.3942186Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:33.3942448Z 
2025-05-07T20:28:35.6218593Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:35.6222551Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:28:37.7927469Z ################################################################################
2025-05-07T20:28:37.7928065Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:28:37.7929025Z ################################################################################
2025-05-07T20:28:37.7937563Z 
2025-05-07T20:28:39.8577745Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:28:41.9874387Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:28:41.9878693Z [BUILD] Successfully ran git submodules update
2025-05-07T20:28:41.9923728Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:28:41.9924408Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:28:41.9936155Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:41.9936510Z env:
2025-05-07T20:28:41.9936737Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:41.9937033Z   BUILD_ENV: build_binary
2025-05-07T20:28:41.9937279Z   BUILD_TARGET: genai
2025-05-07T20:28:41.9937510Z   BUILD_VARIANT: cuda
2025-05-07T20:28:41.9937740Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:41.9937993Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:41.9938299Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:41.9938802Z ##[endgroup]
2025-05-07T20:28:42.3328473Z ################################################################################
2025-05-07T20:28:42.3329241Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:28:42.3329503Z #
2025-05-07T20:28:42.3345928Z # [2025-05-07T20:28:42.334Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.3346580Z ################################################################################
2025-05-07T20:28:42.3346796Z 
2025-05-07T20:28:42.3347159Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.3347853Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.3348191Z 
2025-05-07T20:28:42.3465491Z d2bc5ec7f2c503b96ed71ce870e3919d4c82a2c7  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.3468162Z 
2025-05-07T20:28:42.3468592Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.3468943Z 
2025-05-07T20:28:42.3597480Z fb057b0fc70bac7d6bace794c1630e92472ffbffb4b9efd8fa610079134b2303  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.3600662Z 
2025-05-07T20:28:42.3601539Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.3602174Z 
2025-05-07T20:28:42.3829704Z d723859d888c0acd7c881d03de8ae205  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.3831807Z 
2025-05-07T20:28:42.3844266Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl ...
2025-05-07T20:28:42.3865249Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.0613768Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.0614732Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:28:45.0615603Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:28:45.0616038Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:28:45.0616313Z 
2025-05-07T20:28:52.0321405Z ################################################################################
2025-05-07T20:28:52.0321797Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:28:52.0322171Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:52.0322603Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:28:52.0322915Z [CHECK]
2025-05-07T20:28:52.0323240Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:28:52.0323877Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:28:52.0324274Z ################################################################################
2025-05-07T20:28:52.0324495Z 
2025-05-07T20:28:52.0324635Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:28:56.0211401Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:28:59.9993656Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:03.9908067Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:03.9911655Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:15.9725240Z ################################################################################
2025-05-07T20:29:15.9725801Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:15.9726260Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:15.9726736Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:15.9727218Z ################################################################################
2025-05-07T20:29:15.9727528Z 
2025-05-07T20:29:23.9549368Z ################################################################################
2025-05-07T20:29:23.9550869Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:23.9552224Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:23.9553839Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:23.9554354Z ################################################################################
2025-05-07T20:29:23.9554580Z 
2025-05-07T20:29:23.9554737Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:27.9707806Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:31.9403885Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:36.0485215Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:29:40.0445570Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:29:40.0450078Z [INSTALL] Check for operator registrations ...
2025-05-07T20:29:43.9552527Z fbgemm.nccl_init
2025-05-07T20:29:43.9552764Z 
2025-05-07T20:29:44.0232577Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:29:47.9423808Z fbgemm.gqa_attn_splitk
2025-05-07T20:29:47.9424076Z 
2025-05-07T20:29:48.0070966Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:29:51.9232277Z fbgemm.rope_qkv_decoding
2025-05-07T20:29:51.9232553Z 
2025-05-07T20:29:51.9900067Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:29:51.9900807Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:29:51.9937152Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:29:51.9937610Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:29:51.9953742Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:51.9954096Z env:
2025-05-07T20:29:51.9954321Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:51.9954615Z   BUILD_ENV: build_binary
2025-05-07T20:29:51.9954861Z   BUILD_TARGET: genai
2025-05-07T20:29:51.9955093Z   BUILD_VARIANT: cuda
2025-05-07T20:29:51.9955330Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:51.9955583Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:51.9955886Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:51.9956221Z ##[endgroup]
2025-05-07T20:29:52.3330801Z ################################################################################
2025-05-07T20:29:52.3331323Z # Test All FBGEMM-GPU Modules
2025-05-07T20:29:52.3331660Z #
2025-05-07T20:29:52.3348497Z # [2025-05-07T20:29:52.334Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:29:52.3349066Z ################################################################################
2025-05-07T20:29:52.3349359Z 
2025-05-07T20:30:00.3346791Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:00.3347354Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:00.3347753Z [TEST] Determined the test directories:
2025-05-07T20:30:00.3348070Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:00.3348364Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:00.3348670Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:00.3348857Z 
2025-05-07T20:30:00.3357403Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:00.3364524Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:00.3364962Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:00.3365250Z 
2025-05-07T20:30:00.7650341Z 
2025-05-07T20:30:00.7650788Z [TEST] Installing PyTest ...
2025-05-07T20:30:00.7674960Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:01.8748471Z Channels:
2025-05-07T20:30:01.8748720Z  - conda-forge
2025-05-07T20:30:01.8748955Z Platform: linux-64
2025-05-07T20:30:05.3015720Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:06.4607586Z Solving environment: \ | / done
2025-05-07T20:30:06.6889336Z 
2025-05-07T20:30:06.6889755Z ## Package Plan ##
2025-05-07T20:30:06.6889939Z 
2025-05-07T20:30:06.6890216Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:06.6890626Z 
2025-05-07T20:30:06.6890726Z   added / updated specs:
2025-05-07T20:30:06.6890989Z     - expecttest
2025-05-07T20:30:06.6891202Z     - pytest
2025-05-07T20:30:06.6891327Z 
2025-05-07T20:30:06.6891332Z 
2025-05-07T20:30:06.6891453Z The following packages will be downloaded:
2025-05-07T20:30:06.6891714Z 
2025-05-07T20:30:06.6891836Z     package                    |            build
2025-05-07T20:30:06.6892151Z     ---------------------------|-----------------
2025-05-07T20:30:06.6892527Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:06.6892990Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:06.6893452Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:06.6893885Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:06.6894317Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:06.6894740Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:06.6895141Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:06.6896052Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:06.6896463Z     ------------------------------------------------------------
2025-05-07T20:30:06.6896808Z                                            Total:         428 KB
2025-05-07T20:30:06.6897018Z 
2025-05-07T20:30:06.6897148Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:06.6897373Z 
2025-05-07T20:30:06.6897576Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:06.6898093Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:06.6898624Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:06.6899091Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:06.6899558Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:06.6900001Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:06.6900432Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:06.6900852Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:06.6901110Z 
2025-05-07T20:30:06.6901114Z 
2025-05-07T20:30:06.6901118Z 
2025-05-07T20:30:06.6901263Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:06.6901633Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:06.6901859Z 
2025-05-07T20:30:06.6902249Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:06.6902483Z 
2025-05-07T20:30:06.6902487Z 
2025-05-07T20:30:06.6912689Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:06.6913026Z 
2025-05-07T20:30:06.6913032Z 
2025-05-07T20:30:06.6920579Z 
2025-05-07T20:30:06.6938214Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:06.6938706Z 
2025-05-07T20:30:06.6938711Z 
2025-05-07T20:30:06.6938714Z 
2025-05-07T20:30:06.6939354Z 
2025-05-07T20:30:06.6955547Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:06.6956125Z 
2025-05-07T20:30:06.6956130Z 
2025-05-07T20:30:06.6956134Z 
2025-05-07T20:30:06.6956138Z 
2025-05-07T20:30:06.6958819Z 
2025-05-07T20:30:06.6960382Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:06.6960642Z 
2025-05-07T20:30:06.6960646Z 
2025-05-07T20:30:06.6960650Z 
2025-05-07T20:30:06.6960662Z 
2025-05-07T20:30:06.6960666Z 
2025-05-07T20:30:06.6975724Z 
2025-05-07T20:30:06.6988993Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:06.6989286Z 
2025-05-07T20:30:06.6989297Z 
2025-05-07T20:30:06.6989301Z 
2025-05-07T20:30:06.6989304Z 
2025-05-07T20:30:06.6989308Z 
2025-05-07T20:30:06.6989312Z 
2025-05-07T20:30:06.6992979Z 
2025-05-07T20:30:06.7607119Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:06.7607415Z 
2025-05-07T20:30:06.7607419Z 
2025-05-07T20:30:06.7607423Z 
2025-05-07T20:30:06.7609613Z 
2025-05-07T20:30:06.8023088Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:06.8023403Z 
2025-05-07T20:30:06.8023407Z 
2025-05-07T20:30:06.8023411Z 
2025-05-07T20:30:06.8023415Z 
2025-05-07T20:30:06.8025354Z 
2025-05-07T20:30:06.8109913Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:06.8110176Z 
2025-05-07T20:30:06.8110180Z 
2025-05-07T20:30:06.8110184Z 
2025-05-07T20:30:06.8110187Z 
2025-05-07T20:30:06.8112886Z 
2025-05-07T20:30:06.9391901Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:06.9392176Z 
2025-05-07T20:30:06.9392180Z 
2025-05-07T20:30:06.9392184Z 
2025-05-07T20:30:06.9392188Z 
2025-05-07T20:30:06.9392192Z 
2025-05-07T20:30:06.9574412Z 
2025-05-07T20:30:07.0582708Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:07.0583019Z 
2025-05-07T20:30:07.0583022Z 
2025-05-07T20:30:07.0583026Z 
2025-05-07T20:30:07.0583030Z 
2025-05-07T20:30:07.0583034Z 
2025-05-07T20:30:07.0584955Z 
2025-05-07T20:30:07.0810122Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:07.0810425Z 
2025-05-07T20:30:07.0810430Z 
2025-05-07T20:30:07.0810433Z 
2025-05-07T20:30:07.0811852Z 
2025-05-07T20:30:07.0818395Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:07.0818682Z 
2025-05-07T20:30:07.0818686Z 
2025-05-07T20:30:07.0818690Z 
2025-05-07T20:30:07.0818932Z 
2025-05-07T20:30:07.0900390Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:07.0900679Z 
2025-05-07T20:30:07.0900683Z 
2025-05-07T20:30:07.0900687Z 
2025-05-07T20:30:07.0900691Z 
2025-05-07T20:30:07.0900695Z 
2025-05-07T20:30:07.0907911Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:07.0908171Z 
2025-05-07T20:30:07.0908175Z 
2025-05-07T20:30:07.0908179Z 
2025-05-07T20:30:07.0908183Z 
2025-05-07T20:30:07.0908187Z 
2025-05-07T20:30:07.0908191Z 
2025-05-07T20:30:07.0956423Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:07.0956725Z 
2025-05-07T20:30:07.0956729Z 
2025-05-07T20:30:07.0956733Z 
2025-05-07T20:30:07.0956737Z 
2025-05-07T20:30:07.0956741Z 
2025-05-07T20:30:07.0956745Z 
2025-05-07T20:30:07.0956748Z 
2025-05-07T20:30:07.0960004Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:07.0960336Z 
2025-05-07T20:30:07.0960343Z 
2025-05-07T20:30:07.0960348Z 
2025-05-07T20:30:07.0960351Z 
2025-05-07T20:30:07.0960355Z 
2025-05-07T20:30:07.0960359Z 
2025-05-07T20:30:07.0960362Z 
2025-05-07T20:30:07.1035331Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:07.1035610Z 
2025-05-07T20:30:07.1035614Z 
2025-05-07T20:30:07.1035618Z 
2025-05-07T20:30:07.1035621Z 
2025-05-07T20:30:07.1035625Z 
2025-05-07T20:30:07.1035629Z 
2025-05-07T20:30:07.1035633Z 
2025-05-07T20:30:07.1141694Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:07.1141968Z 
2025-05-07T20:30:07.1141972Z 
2025-05-07T20:30:07.1148988Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:07.1149899Z 
2025-05-07T20:30:07.1153204Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:07.1153527Z 
2025-05-07T20:30:07.1153531Z 
2025-05-07T20:30:07.1166965Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:07.1167461Z 
2025-05-07T20:30:07.1348742Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:07.1348997Z 
2025-05-07T20:30:07.1349001Z 
2025-05-07T20:30:07.1370371Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:07.1370627Z 
2025-05-07T20:30:07.1398741Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:07.1501636Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:07.1715019Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:07.1715497Z 
2025-05-07T20:30:07.1715505Z 
2025-05-07T20:30:07.1715513Z 
2025-05-07T20:30:07.1724618Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:07.1724877Z 
2025-05-07T20:30:07.1724881Z 
2025-05-07T20:30:07.1725889Z 
2025-05-07T20:30:07.1768100Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:07.1829922Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:07.1830151Z 
2025-05-07T20:30:07.1830155Z 
2025-05-07T20:30:07.1830166Z 
2025-05-07T20:30:07.1836552Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:07.1837227Z                                                      
2025-05-07T20:30:07.1837594Z 
2025-05-07T20:30:07.1837872Z                                                      [A
2025-05-07T20:30:07.1838214Z 
2025-05-07T20:30:07.1838218Z 
2025-05-07T20:30:07.1838998Z                                                      [A[A
2025-05-07T20:30:07.1839299Z 
2025-05-07T20:30:07.1839304Z 
2025-05-07T20:30:07.1839308Z 
2025-05-07T20:30:07.1839530Z                                                      [A[A[A
2025-05-07T20:30:07.1840009Z 
2025-05-07T20:30:07.1840017Z 
2025-05-07T20:30:07.1840086Z 
2025-05-07T20:30:07.1840092Z 
2025-05-07T20:30:07.1840483Z                                                      [A[A[A[A
2025-05-07T20:30:07.1840750Z 
2025-05-07T20:30:07.1840753Z 
2025-05-07T20:30:07.1840797Z 
2025-05-07T20:30:07.1840801Z 
2025-05-07T20:30:07.1840804Z 
2025-05-07T20:30:07.1841017Z                                                      [A[A[A[A[A
2025-05-07T20:30:07.1841260Z 
2025-05-07T20:30:07.1841264Z 
2025-05-07T20:30:07.1841268Z 
2025-05-07T20:30:07.1841271Z 
2025-05-07T20:30:07.1841311Z 
2025-05-07T20:30:07.1841315Z 
2025-05-07T20:30:07.1841546Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:07.1841771Z 
2025-05-07T20:30:07.1841775Z 
2025-05-07T20:30:07.1841778Z 
2025-05-07T20:30:07.1841782Z 
2025-05-07T20:30:07.1841882Z 
2025-05-07T20:30:07.1841886Z 
2025-05-07T20:30:07.1841889Z 
2025-05-07T20:30:07.1851020Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:07.2844709Z Preparing transaction: \ done
2025-05-07T20:30:07.3849608Z Verifying transaction: / done
2025-05-07T20:30:09.2875952Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:09.4232043Z [TEST] Checking imports ...
2025-05-07T20:30:13.3988476Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:13.4001327Z [TEST] Setting feature flags ...
2025-05-07T20:30:13.4001773Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:13.4002106Z 
2025-05-07T20:30:13.8283032Z 
2025-05-07T20:30:13.8283910Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:13.8285297Z ################################################################################
2025-05-07T20:30:13.8285753Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:13.8286056Z #
2025-05-07T20:30:13.8305685Z # [2025-05-07T20:30:13.830Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:13.8306464Z ################################################################################
2025-05-07T20:30:13.8306678Z 
2025-05-07T20:30:13.8313480Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:13.8342217Z ./attention/gqa_test.py
2025-05-07T20:30:13.8342541Z ./coalesce/coalesce_test.py
2025-05-07T20:30:13.8342931Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:13.8343218Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:13.8343506Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:13.8343769Z ./moe/activation_test.py
2025-05-07T20:30:13.8344019Z ./moe/gather_scatter_test.py
2025-05-07T20:30:13.8344271Z ./moe/layers_test.py
2025-05-07T20:30:13.8344510Z ./moe/shuffling_test.py
2025-05-07T20:30:13.8344761Z ./quantize/quantize_test.py
2025-05-07T20:30:13.8344921Z 
2025-05-07T20:30:13.8345036Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:13.8345253Z 
2025-05-07T20:30:13.8362698Z ################################################################################
2025-05-07T20:30:13.8378141Z # [2025-05-07T20:30:13.837Z] Run Python Test Suite:
2025-05-07T20:30:13.8378519Z #   ./attention/gqa_test.py
2025-05-07T20:30:13.8378898Z ################################################################################
2025-05-07T20:30:13.8402501Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:13.8403107Z 
2025-05-07T20:30:16.3632409Z ============================= test session starts ==============================
2025-05-07T20:30:16.3633486Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:16.3634351Z cachedir: .pytest_cache
2025-05-07T20:30:16.3635339Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:16.3636940Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:16.3637644Z plugins: hypothesis-6.131.14
2025-05-07T20:30:18.0868819Z collecting ... collected 2 items
2025-05-07T20:30:18.0869047Z 
2025-05-07T20:30:52.9141952Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:30:52.9142573Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9142973Z     int4_kv=False,
2025-05-07T20:30:52.9143242Z     num_groups=1,
2025-05-07T20:30:52.9143505Z     B=1,
2025-05-07T20:30:52.9143726Z     MAX_T=4,
2025-05-07T20:30:52.9143964Z     N_H_L=1,
2025-05-07T20:30:52.9144211Z )
2025-05-07T20:30:52.9144446Z Trying example: test_gqa(
2025-05-07T20:30:52.9144810Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9145202Z     int4_kv=True,
2025-05-07T20:30:52.9145453Z     num_groups=1,
2025-05-07T20:30:52.9145705Z     B=1,
2025-05-07T20:30:52.9145941Z     MAX_T=4,
2025-05-07T20:30:52.9146187Z     N_H_L=1,
2025-05-07T20:30:52.9146424Z )
2025-05-07T20:30:52.9146703Z Trying example: test_gqa(
2025-05-07T20:30:52.9147069Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9147460Z     int4_kv=True,
2025-05-07T20:30:52.9147714Z     num_groups=4,
2025-05-07T20:30:52.9147952Z     B=23,
2025-05-07T20:30:52.9148182Z     MAX_T=33,
2025-05-07T20:30:52.9148420Z     N_H_L=68,
2025-05-07T20:30:52.9148653Z )
2025-05-07T20:30:52.9148895Z Trying example: test_gqa(
2025-05-07T20:30:52.9149256Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9149631Z     int4_kv=True,
2025-05-07T20:30:52.9149887Z     num_groups=4,
2025-05-07T20:30:52.9150133Z     B=77,
2025-05-07T20:30:52.9150350Z     MAX_T=4,
2025-05-07T20:30:52.9150583Z     N_H_L=1,
2025-05-07T20:30:52.9150811Z )
2025-05-07T20:30:52.9151036Z Trying example: test_gqa(
2025-05-07T20:30:52.9151419Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9151818Z     int4_kv=True,
2025-05-07T20:30:52.9152072Z     num_groups=4,
2025-05-07T20:30:52.9152313Z     B=77,
2025-05-07T20:30:52.9153078Z     MAX_T=52,
2025-05-07T20:30:52.9153317Z     N_H_L=67,
2025-05-07T20:30:52.9153546Z )
2025-05-07T20:30:52.9153783Z Trying example: test_gqa(
2025-05-07T20:30:52.9154134Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9154510Z     int4_kv=False,
2025-05-07T20:30:52.9154792Z     num_groups=4,
2025-05-07T20:30:52.9155042Z     B=57,
2025-05-07T20:30:52.9155263Z     MAX_T=45,
2025-05-07T20:30:52.9155503Z     N_H_L=120,
2025-05-07T20:30:52.9155739Z )
2025-05-07T20:30:52.9155962Z Trying example: test_gqa(
2025-05-07T20:30:52.9156313Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9156696Z     int4_kv=True,
2025-05-07T20:30:52.9156944Z     num_groups=4,
2025-05-07T20:30:52.9157193Z     B=52,
2025-05-07T20:30:52.9157419Z     MAX_T=42,
2025-05-07T20:30:52.9157644Z     N_H_L=53,
2025-05-07T20:30:52.9157874Z )
2025-05-07T20:30:52.9158105Z Trying example: test_gqa(
2025-05-07T20:30:52.9158445Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9158841Z     int4_kv=True,
2025-05-07T20:30:52.9159094Z     num_groups=1,
2025-05-07T20:30:52.9159334Z     B=77,
2025-05-07T20:30:52.9159560Z     MAX_T=95,
2025-05-07T20:30:52.9159796Z     N_H_L=53,
2025-05-07T20:30:52.9160027Z )
2025-05-07T20:30:52.9160253Z Trying example: test_gqa(
2025-05-07T20:30:52.9160603Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9160980Z     int4_kv=True,
2025-05-07T20:30:52.9161223Z     num_groups=4,
2025-05-07T20:30:52.9161476Z     B=113,
2025-05-07T20:30:52.9161704Z     MAX_T=48,
2025-05-07T20:30:52.9161958Z     N_H_L=96,
2025-05-07T20:30:52.9162215Z )
2025-05-07T20:30:52.9162452Z Trying example: test_gqa(
2025-05-07T20:30:52.9162796Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9163179Z     int4_kv=False,
2025-05-07T20:30:52.9163645Z     num_groups=1,
2025-05-07T20:30:52.9163890Z     B=51,
2025-05-07T20:30:52.9164116Z     MAX_T=61,
2025-05-07T20:30:52.9164352Z     N_H_L=69,
2025-05-07T20:30:52.9164803Z )
2025-05-07T20:30:52.9165050Z Trying example: test_gqa(
2025-05-07T20:30:52.9165401Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9165776Z     int4_kv=False,
2025-05-07T20:30:52.9166031Z     num_groups=4,
2025-05-07T20:30:52.9166284Z     B=17,
2025-05-07T20:30:52.9166510Z     MAX_T=113,
2025-05-07T20:30:52.9166751Z     N_H_L=65,
2025-05-07T20:30:52.9166984Z )
2025-05-07T20:30:52.9167209Z Trying example: test_gqa(
2025-05-07T20:30:52.9167561Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9167949Z     int4_kv=False,
2025-05-07T20:30:52.9168209Z     num_groups=4,
2025-05-07T20:30:52.9168475Z     B=17,
2025-05-07T20:30:52.9168722Z     MAX_T=65,
2025-05-07T20:30:52.9168979Z     N_H_L=65,
2025-05-07T20:30:52.9169211Z )
2025-05-07T20:30:52.9169463Z Trying example: test_gqa(
2025-05-07T20:30:52.9169873Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9170278Z     int4_kv=False,
2025-05-07T20:30:52.9170538Z     num_groups=4,
2025-05-07T20:30:52.9170819Z     B=65,
2025-05-07T20:30:52.9171067Z     MAX_T=65,
2025-05-07T20:30:52.9171296Z     N_H_L=65,
2025-05-07T20:30:52.9171530Z )
2025-05-07T20:30:52.9171801Z Trying example: test_gqa(
2025-05-07T20:30:52.9172143Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9172524Z     int4_kv=False,
2025-05-07T20:30:52.9172776Z     num_groups=1,
2025-05-07T20:30:52.9173028Z     B=6,
2025-05-07T20:30:52.9173250Z     MAX_T=108,
2025-05-07T20:30:52.9173490Z     N_H_L=14,
2025-05-07T20:30:52.9173718Z )
2025-05-07T20:30:52.9173942Z Trying example: test_gqa(
2025-05-07T20:30:52.9174290Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9174672Z     int4_kv=False,
2025-05-07T20:30:52.9174921Z     num_groups=1,
2025-05-07T20:30:52.9175168Z     B=6,
2025-05-07T20:30:52.9175395Z     MAX_T=14,
2025-05-07T20:30:52.9175621Z     N_H_L=14,
2025-05-07T20:30:52.9175850Z )
2025-05-07T20:30:52.9176082Z Trying example: test_gqa(
2025-05-07T20:30:52.9176479Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9176959Z     int4_kv=False,
2025-05-07T20:30:52.9177212Z     num_groups=1,
2025-05-07T20:30:52.9177451Z     B=6,
2025-05-07T20:30:52.9177676Z     MAX_T=6,
2025-05-07T20:30:52.9177907Z     N_H_L=14,
2025-05-07T20:30:52.9178129Z )
2025-05-07T20:30:52.9178358Z Trying example: test_gqa(
2025-05-07T20:30:52.9178705Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9179081Z     int4_kv=False,
2025-05-07T20:30:52.9179332Z     num_groups=1,
2025-05-07T20:30:52.9179576Z     B=6,
2025-05-07T20:30:52.9179794Z     MAX_T=6,
2025-05-07T20:30:52.9180028Z     N_H_L=6,
2025-05-07T20:30:52.9180255Z )
2025-05-07T20:30:52.9180481Z Trying example: test_gqa(
2025-05-07T20:30:52.9180833Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9181220Z     int4_kv=False,
2025-05-07T20:30:52.9181522Z     num_groups=1,
2025-05-07T20:30:52.9181763Z     B=70,
2025-05-07T20:30:52.9181987Z     MAX_T=94,
2025-05-07T20:30:52.9182219Z     N_H_L=78,
2025-05-07T20:30:52.9182455Z )
2025-05-07T20:30:52.9182689Z Trying example: test_gqa(
2025-05-07T20:30:52.9183040Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9183413Z     int4_kv=False,
2025-05-07T20:30:52.9183666Z     num_groups=1,
2025-05-07T20:30:52.9183918Z     B=78,
2025-05-07T20:30:52.9184134Z     MAX_T=94,
2025-05-07T20:30:52.9184369Z     N_H_L=78,
2025-05-07T20:30:52.9184598Z )
2025-05-07T20:30:52.9184823Z Trying example: test_gqa(
2025-05-07T20:30:52.9185172Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9185555Z     int4_kv=False,
2025-05-07T20:30:52.9185800Z     num_groups=1,
2025-05-07T20:30:52.9186045Z     B=94,
2025-05-07T20:30:52.9186269Z     MAX_T=94,
2025-05-07T20:30:52.9186492Z     N_H_L=78,
2025-05-07T20:30:52.9186720Z )
2025-05-07T20:30:52.9186950Z Trying example: test_gqa(
2025-05-07T20:30:52.9187288Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9187668Z     int4_kv=False,
2025-05-07T20:30:52.9188030Z     num_groups=1,
2025-05-07T20:30:52.9188279Z     B=94,
2025-05-07T20:30:52.9188509Z     MAX_T=94,
2025-05-07T20:30:52.9188746Z     N_H_L=94,
2025-05-07T20:30:52.9188968Z )
2025-05-07T20:30:52.9189199Z Trying example: test_gqa(
2025-05-07T20:30:52.9189547Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9189924Z     int4_kv=False,
2025-05-07T20:30:52.9190169Z     num_groups=4,
2025-05-07T20:30:52.9190413Z     B=41,
2025-05-07T20:30:52.9190638Z     MAX_T=105,
2025-05-07T20:30:52.9190891Z     N_H_L=126,
2025-05-07T20:30:52.9191097Z )
2025-05-07T20:30:52.9191284Z Trying example: test_gqa(
2025-05-07T20:30:52.9191567Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9191879Z     int4_kv=False,
2025-05-07T20:30:52.9192087Z     num_groups=4,
2025-05-07T20:30:52.9192292Z     B=105,
2025-05-07T20:30:52.9192481Z     MAX_T=105,
2025-05-07T20:30:52.9192683Z     N_H_L=126,
2025-05-07T20:30:52.9192873Z )
2025-05-07T20:30:52.9193063Z Trying example: test_gqa(
2025-05-07T20:30:52.9193358Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9193667Z     int4_kv=False,
2025-05-07T20:30:52.9193882Z     num_groups=4,
2025-05-07T20:30:52.9194088Z     B=105,
2025-05-07T20:30:52.9194269Z     MAX_T=105,
2025-05-07T20:30:52.9194468Z     N_H_L=105,
2025-05-07T20:30:52.9194661Z )
2025-05-07T20:30:52.9194845Z Trying example: test_gqa(
2025-05-07T20:30:52.9195134Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9195443Z     int4_kv=True,
2025-05-07T20:30:52.9195653Z     num_groups=1,
2025-05-07T20:30:52.9195852Z     B=95,
2025-05-07T20:30:52.9196040Z     MAX_T=114,
2025-05-07T20:30:52.9196238Z     N_H_L=43,
2025-05-07T20:30:52.9196422Z )
2025-05-07T20:30:52.9196613Z Trying example: test_gqa(
2025-05-07T20:30:52.9196904Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9197204Z     int4_kv=True,
2025-05-07T20:30:52.9197410Z     num_groups=1,
2025-05-07T20:30:52.9197615Z     B=43,
2025-05-07T20:30:52.9197795Z     MAX_T=114,
2025-05-07T20:30:52.9198095Z     N_H_L=43,
2025-05-07T20:30:52.9198288Z )
2025-05-07T20:30:52.9198473Z Trying example: test_gqa(
2025-05-07T20:30:52.9198769Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9199086Z     int4_kv=True,
2025-05-07T20:30:52.9199284Z     num_groups=1,
2025-05-07T20:30:52.9199676Z     B=43,
2025-05-07T20:30:52.9199865Z     MAX_T=43,
2025-05-07T20:30:52.9200053Z     N_H_L=43,
2025-05-07T20:30:52.9200242Z )
2025-05-07T20:30:52.9200437Z Trying example: test_gqa(
2025-05-07T20:30:52.9200719Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9201033Z     int4_kv=False,
2025-05-07T20:30:52.9201240Z     num_groups=1,
2025-05-07T20:30:52.9201444Z     B=21,
2025-05-07T20:30:52.9201620Z     MAX_T=38,
2025-05-07T20:30:52.9201816Z     N_H_L=42,
2025-05-07T20:30:52.9202007Z )
2025-05-07T20:30:52.9202189Z Trying example: test_gqa(
2025-05-07T20:30:52.9202477Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9202792Z     int4_kv=False,
2025-05-07T20:30:52.9203002Z     num_groups=1,
2025-05-07T20:30:52.9203212Z     B=38,
2025-05-07T20:30:52.9203521Z     MAX_T=38,
2025-05-07T20:30:52.9203723Z     N_H_L=42,
2025-05-07T20:30:52.9203916Z )
2025-05-07T20:30:52.9204111Z Trying example: test_gqa(
2025-05-07T20:30:52.9204400Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9204712Z     int4_kv=False,
2025-05-07T20:30:52.9204929Z     num_groups=1,
2025-05-07T20:30:52.9205129Z     B=38,
2025-05-07T20:30:52.9205319Z     MAX_T=42,
2025-05-07T20:30:52.9205508Z     N_H_L=42,
2025-05-07T20:30:52.9205694Z )
2025-05-07T20:30:52.9205887Z Trying example: test_gqa(
2025-05-07T20:30:52.9206176Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9206483Z     int4_kv=False,
2025-05-07T20:30:52.9206696Z     num_groups=1,
2025-05-07T20:30:52.9206903Z     B=42,
2025-05-07T20:30:52.9207083Z     MAX_T=42,
2025-05-07T20:30:52.9207281Z     N_H_L=42,
2025-05-07T20:30:52.9207472Z )
2025-05-07T20:30:52.9207757Z Trying example: test_gqa(
2025-05-07T20:30:52.9208060Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9208374Z     int4_kv=True,
2025-05-07T20:30:52.9208574Z     num_groups=1,
2025-05-07T20:30:52.9208778Z     B=74,
2025-05-07T20:30:52.9208961Z     MAX_T=20,
2025-05-07T20:30:52.9209156Z     N_H_L=15,
2025-05-07T20:30:52.9209337Z )
2025-05-07T20:30:52.9209526Z Trying example: test_gqa(
2025-05-07T20:30:52.9209818Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9210122Z     int4_kv=True,
2025-05-07T20:30:52.9210332Z     num_groups=1,
2025-05-07T20:30:52.9210536Z     B=20,
2025-05-07T20:30:52.9210715Z     MAX_T=20,
2025-05-07T20:30:52.9210913Z     N_H_L=15,
2025-05-07T20:30:52.9211103Z )
2025-05-07T20:30:52.9211289Z Trying example: test_gqa(
2025-05-07T20:30:52.9211579Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9211890Z     int4_kv=True,
2025-05-07T20:30:52.9212089Z     num_groups=1,
2025-05-07T20:30:52.9212294Z     B=20,
2025-05-07T20:30:52.9212485Z     MAX_T=15,
2025-05-07T20:30:52.9212677Z     N_H_L=15,
2025-05-07T20:30:52.9212871Z )
2025-05-07T20:30:52.9213060Z Trying example: test_gqa(
2025-05-07T20:30:52.9213341Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9213663Z     int4_kv=True,
2025-05-07T20:30:52.9213872Z     num_groups=1,
2025-05-07T20:30:52.9214068Z     B=15,
2025-05-07T20:30:52.9214258Z     MAX_T=20,
2025-05-07T20:30:52.9214452Z     N_H_L=15,
2025-05-07T20:30:52.9214638Z )
2025-05-07T20:30:52.9214834Z Trying example: test_gqa(
2025-05-07T20:30:52.9215127Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9215429Z     int4_kv=True,
2025-05-07T20:30:52.9215634Z     num_groups=1,
2025-05-07T20:30:52.9215839Z     B=15,
2025-05-07T20:30:52.9216020Z     MAX_T=15,
2025-05-07T20:30:52.9216219Z     N_H_L=15,
2025-05-07T20:30:52.9216413Z )
2025-05-07T20:30:52.9216609Z Trying example: test_gqa(
2025-05-07T20:30:52.9216892Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9217330Z     int4_kv=False,
2025-05-07T20:30:52.9217546Z     num_groups=4,
2025-05-07T20:30:52.9217744Z     B=117,
2025-05-07T20:30:52.9217935Z     MAX_T=104,
2025-05-07T20:30:52.9218134Z     N_H_L=69,
2025-05-07T20:30:52.9218316Z )
2025-05-07T20:30:52.9218509Z Trying example: test_gqa(
2025-05-07T20:30:52.9218796Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9219099Z     int4_kv=False,
2025-05-07T20:30:52.9219311Z     num_groups=4,
2025-05-07T20:30:52.9219517Z     B=117,
2025-05-07T20:30:52.9219698Z     MAX_T=117,
2025-05-07T20:30:52.9219895Z     N_H_L=69,
2025-05-07T20:30:52.9220085Z )
2025-05-07T20:30:52.9220269Z Trying example: test_gqa(
2025-05-07T20:30:52.9220582Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9220895Z     int4_kv=False,
2025-05-07T20:30:52.9221099Z     num_groups=4,
2025-05-07T20:30:52.9221304Z     B=69,
2025-05-07T20:30:52.9221517Z     MAX_T=117,
2025-05-07T20:30:52.9221737Z     N_H_L=69,
2025-05-07T20:30:52.9221927Z )
2025-05-07T20:30:52.9222129Z Trying example: test_gqa(
2025-05-07T20:30:52.9222417Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.9222798Z     int4_kv=False,
2025-05-07T20:30:52.9223098Z     num_groups=4,
2025-05-07T20:30:52.9223352Z     B=117,
2025-05-07T20:30:52.9232768Z     MAX_T=69,
2025-05-07T20:30:52.9233072Z     N_H_L=69,
2025-05-07T20:30:52.9233280Z )
2025-05-07T20:30:52.9233476Z PASSED
2025-05-07T20:30:52.9457225Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:30:52.9457558Z 
2025-05-07T20:30:52.9458202Z =========================== short test summary info ============================
2025-05-07T20:30:52.9459207Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when CUDA is not available or xformers is not available
2025-05-07T20:30:52.9460145Z ======================== 1 passed, 1 skipped in 37.06s =========================
2025-05-07T20:30:53.6294817Z 
2025-05-07T20:30:53.6295373Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:30:53.6317036Z [TEST] Python test time for ./attention/gqa_test.py: 40 seconds
2025-05-07T20:30:53.6317326Z 
2025-05-07T20:30:53.6317460Z 
2025-05-07T20:30:53.6317466Z 
2025-05-07T20:30:53.6317493Z 
2025-05-07T20:30:53.6338027Z ################################################################################
2025-05-07T20:30:53.6353845Z # [2025-05-07T20:30:53.635Z] Run Python Test Suite:
2025-05-07T20:30:53.6354224Z #   ./coalesce/coalesce_test.py
2025-05-07T20:30:53.6354600Z ################################################################################
2025-05-07T20:30:53.6380319Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:30:53.6380928Z 
2025-05-07T20:30:55.7972950Z ============================= test session starts ==============================
2025-05-07T20:30:55.7973657Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:55.7974210Z cachedir: .pytest_cache
2025-05-07T20:30:55.7974802Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:55.7975523Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:55.7975940Z plugins: hypothesis-6.131.14
2025-05-07T20:30:57.4715670Z collecting ... collected 1 item
2025-05-07T20:30:57.4716095Z 
2025-05-07T20:30:58.2035159Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:30:58.2035506Z 
2025-05-07T20:30:58.2035653Z ============================== 1 passed in 2.52s ===============================
2025-05-07T20:30:58.8898670Z 
2025-05-07T20:30:58.8899261Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:30:58.8919804Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:30:58.8920138Z 
2025-05-07T20:30:58.8920501Z 
2025-05-07T20:30:58.8920506Z 
2025-05-07T20:30:58.8920520Z 
2025-05-07T20:30:58.8941206Z ################################################################################
2025-05-07T20:30:58.8956474Z # [2025-05-07T20:30:58.895Z] Run Python Test Suite:
2025-05-07T20:30:58.8956802Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:30:58.8957098Z ################################################################################
2025-05-07T20:30:58.8981191Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:30:58.8981818Z 
2025-05-07T20:31:01.0619113Z ============================= test session starts ==============================
2025-05-07T20:31:01.0619738Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:01.0620269Z cachedir: .pytest_cache
2025-05-07T20:31:01.0620866Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:01.0621611Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:01.0622028Z plugins: hypothesis-6.131.14
2025-05-07T20:31:02.7706633Z collecting ... collected 5 items
2025-05-07T20:31:02.7706961Z 
2025-05-07T20:31:02.7716276Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:02.7736706Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:02.7745486Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:02.7752217Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:02.7766620Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:02.7767093Z 
2025-05-07T20:31:02.7767610Z =========================== short test summary info ============================
2025-05-07T20:31:02.7768345Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:02.7769276Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:02.7770203Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:02.7771128Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:02.7772054Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:02.7772701Z ============================== 5 skipped in 1.83s ==============================
2025-05-07T20:31:03.3837838Z 
2025-05-07T20:31:03.3840585Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:03.3862476Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:31:03.3862890Z 
2025-05-07T20:31:03.3862896Z 
2025-05-07T20:31:03.3862902Z 
2025-05-07T20:31:03.3862907Z 
2025-05-07T20:31:03.3883548Z ################################################################################
2025-05-07T20:31:03.3902190Z # [2025-05-07T20:31:03.389Z] Run Python Test Suite:
2025-05-07T20:31:03.3902679Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:03.3903122Z ################################################################################
2025-05-07T20:31:03.3926795Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:03.3927554Z 
2025-05-07T20:31:05.5462058Z ============================= test session starts ==============================
2025-05-07T20:31:05.5463198Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:05.5463710Z cachedir: .pytest_cache
2025-05-07T20:31:05.5464284Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:05.5465008Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:05.5465418Z plugins: hypothesis-6.131.14
2025-05-07T20:31:07.3892566Z collecting ... collected 2 items
2025-05-07T20:31:07.3893208Z 
2025-05-07T20:31:07.3901556Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:07.3915511Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:07.3916120Z 
2025-05-07T20:31:07.3916386Z =========================== short test summary info ============================
2025-05-07T20:31:07.3917090Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:07.3917922Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:07.3918523Z ============================== 2 skipped in 1.96s ==============================
2025-05-07T20:31:08.0128909Z 
2025-05-07T20:31:08.0129704Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:08.0148812Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:31:08.0149272Z 
2025-05-07T20:31:08.0149277Z 
2025-05-07T20:31:08.0149281Z 
2025-05-07T20:31:08.0149285Z 
2025-05-07T20:31:08.0171148Z ################################################################################
2025-05-07T20:31:08.0186421Z # [2025-05-07T20:31:08.018Z] Run Python Test Suite:
2025-05-07T20:31:08.0187113Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:08.0187416Z ################################################################################
2025-05-07T20:31:08.0211325Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:08.0211940Z 
2025-05-07T20:31:10.1767531Z ============================= test session starts ==============================
2025-05-07T20:31:10.1768359Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:10.1768887Z cachedir: .pytest_cache
2025-05-07T20:31:10.1769464Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:10.1770199Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:10.1770604Z plugins: hypothesis-6.131.14
2025-05-07T20:31:11.9613015Z collecting ... collected 4 items
2025-05-07T20:31:11.9613349Z 
2025-05-07T20:31:14.8073416Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:14.8199294Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:14.8345859Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:14.8471333Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:14.8471689Z 
2025-05-07T20:31:14.8471847Z =========================== short test summary info ============================
2025-05-07T20:31:14.8472538Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:14.8473463Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when xformers is not available
2025-05-07T20:31:14.8474099Z ============================== 4 skipped in 4.79s ==============================
2025-05-07T20:31:16.6951106Z 
2025-05-07T20:31:16.6951856Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:16.6972132Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:16.6972475Z 
2025-05-07T20:31:16.6972481Z 
2025-05-07T20:31:16.6972486Z 
2025-05-07T20:31:16.6972491Z 
2025-05-07T20:31:16.6994392Z ################################################################################
2025-05-07T20:31:16.7010190Z # [2025-05-07T20:31:16.700Z] Run Python Test Suite:
2025-05-07T20:31:16.7010638Z #   ./moe/activation_test.py
2025-05-07T20:31:16.7010972Z ################################################################################
2025-05-07T20:31:16.7034911Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:16.7035515Z 
2025-05-07T20:31:18.8645616Z ============================= test session starts ==============================
2025-05-07T20:31:18.8646291Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:18.8646827Z cachedir: .pytest_cache
2025-05-07T20:31:18.8647414Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:18.8648142Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:18.8648561Z plugins: hypothesis-6.131.14
2025-05-07T20:31:20.5162861Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:20.6685356Z collecting ... collected 2 items
2025-05-07T20:31:20.6685577Z 
2025-05-07T20:31:26.1336560Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:26.1337206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1338025Z     T=1,
2025-05-07T20:31:26.1338234Z     D=5120,
2025-05-07T20:31:26.1338670Z     contiguous=True,
2025-05-07T20:31:26.1338905Z     compiled=True,
2025-05-07T20:31:26.1339115Z )
2025-05-07T20:31:26.1339319Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1339704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1340082Z     T=4096,
2025-05-07T20:31:26.1340275Z     D=5120,
2025-05-07T20:31:26.1340479Z     contiguous=True,
2025-05-07T20:31:26.1340699Z     compiled=True,
2025-05-07T20:31:26.1340911Z )
2025-05-07T20:31:26.1341116Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1341491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1341872Z     T=4096,
2025-05-07T20:31:26.1342068Z     D=7168,
2025-05-07T20:31:26.1342261Z     contiguous=False,
2025-05-07T20:31:26.1342493Z     compiled=False,
2025-05-07T20:31:26.1342703Z )
2025-05-07T20:31:26.1342894Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1343280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1343664Z     T=4096,
2025-05-07T20:31:26.1343856Z     D=5120,
2025-05-07T20:31:26.1344047Z     contiguous=False,
2025-05-07T20:31:26.1344276Z     compiled=True,
2025-05-07T20:31:26.1344487Z )
2025-05-07T20:31:26.1344678Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1345056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1345440Z     T=1,
2025-05-07T20:31:26.1345623Z     D=7168,
2025-05-07T20:31:26.1345821Z     contiguous=True,
2025-05-07T20:31:26.1346053Z     compiled=True,
2025-05-07T20:31:26.1347716Z )
2025-05-07T20:31:26.1347918Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1348293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1348669Z     T=1,
2025-05-07T20:31:26.1348857Z     D=7168,
2025-05-07T20:31:26.1349058Z     contiguous=False,
2025-05-07T20:31:26.1349281Z     compiled=True,
2025-05-07T20:31:26.1349492Z )
2025-05-07T20:31:26.1349871Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1350244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1350628Z     T=4096,
2025-05-07T20:31:26.1350822Z     D=5120,
2025-05-07T20:31:26.1351012Z     contiguous=False,
2025-05-07T20:31:26.1351245Z     compiled=False,
2025-05-07T20:31:26.1351456Z )
2025-05-07T20:31:26.1351646Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1352027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1352450Z     T=1,
2025-05-07T20:31:26.1352641Z     D=7168,
2025-05-07T20:31:26.1352831Z     contiguous=True,
2025-05-07T20:31:26.1353056Z     compiled=False,
2025-05-07T20:31:26.1353262Z )
2025-05-07T20:31:26.1353454Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1353831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1354210Z     T=2048,
2025-05-07T20:31:26.1354394Z     D=5120,
2025-05-07T20:31:26.1354597Z     contiguous=True,
2025-05-07T20:31:26.1354829Z     compiled=True,
2025-05-07T20:31:26.1355029Z )
2025-05-07T20:31:26.1355227Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1355601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1355976Z     T=2048,
2025-05-07T20:31:26.1356167Z     D=7168,
2025-05-07T20:31:26.1356363Z     contiguous=True,
2025-05-07T20:31:26.1356580Z     compiled=True,
2025-05-07T20:31:26.1356789Z )
2025-05-07T20:31:26.1356991Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1357364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1357744Z     T=2048,
2025-05-07T20:31:26.1357926Z     D=7168,
2025-05-07T20:31:26.1358123Z     contiguous=True,
2025-05-07T20:31:26.1358345Z     compiled=False,
2025-05-07T20:31:26.1358553Z )
2025-05-07T20:31:26.1358750Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1359122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1359643Z     T=128,
2025-05-07T20:31:26.1359834Z     D=5120,
2025-05-07T20:31:26.1360030Z     contiguous=False,
2025-05-07T20:31:26.1360265Z     compiled=True,
2025-05-07T20:31:26.1360475Z )
2025-05-07T20:31:26.1360673Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1361054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1361439Z     T=128,
2025-05-07T20:31:26.1361626Z     D=5120,
2025-05-07T20:31:26.1361826Z     contiguous=True,
2025-05-07T20:31:26.1362053Z     compiled=True,
2025-05-07T20:31:26.1362257Z )
2025-05-07T20:31:26.1362461Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1362839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1363218Z     T=16384,
2025-05-07T20:31:26.1363591Z     D=5120,
2025-05-07T20:31:26.1363792Z     contiguous=False,
2025-05-07T20:31:26.1364016Z     compiled=True,
2025-05-07T20:31:26.1364225Z )
2025-05-07T20:31:26.1364425Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1364809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1365192Z     T=16384,
2025-05-07T20:31:26.1365391Z     D=5120,
2025-05-07T20:31:26.1365591Z     contiguous=False,
2025-05-07T20:31:26.1365816Z     compiled=False,
2025-05-07T20:31:26.1366021Z )
2025-05-07T20:31:26.1366220Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1366593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1366978Z     T=128,
2025-05-07T20:31:26.1367168Z     D=7168,
2025-05-07T20:31:26.1367372Z     contiguous=True,
2025-05-07T20:31:26.1367598Z     compiled=False,
2025-05-07T20:31:26.1367807Z )
2025-05-07T20:31:26.1368006Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1368376Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1368765Z     T=128,
2025-05-07T20:31:26.1368958Z     D=7168,
2025-05-07T20:31:26.1369153Z     contiguous=False,
2025-05-07T20:31:26.1369383Z     compiled=False,
2025-05-07T20:31:26.1369691Z )
2025-05-07T20:31:26.1369883Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1370262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1370647Z     T=1,
2025-05-07T20:31:26.1370835Z     D=5120,
2025-05-07T20:31:26.1371029Z     contiguous=False,
2025-05-07T20:31:26.1371257Z     compiled=False,
2025-05-07T20:31:26.1371469Z )
2025-05-07T20:31:26.1371660Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1372033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1372443Z     T=1,
2025-05-07T20:31:26.1372646Z     D=7168,
2025-05-07T20:31:26.1372842Z     contiguous=False,
2025-05-07T20:31:26.1373070Z     compiled=False,
2025-05-07T20:31:26.1373278Z )
2025-05-07T20:31:26.1373478Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1373855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1374231Z     T=4096,
2025-05-07T20:31:26.1374420Z     D=5120,
2025-05-07T20:31:26.1374621Z     contiguous=True,
2025-05-07T20:31:26.1374847Z     compiled=False,
2025-05-07T20:31:26.1375051Z )
2025-05-07T20:31:26.1375248Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1375617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1376000Z     T=128,
2025-05-07T20:31:26.1376187Z     D=7168,
2025-05-07T20:31:26.1376390Z     contiguous=True,
2025-05-07T20:31:26.1376612Z     compiled=True,
2025-05-07T20:31:26.1376820Z )
2025-05-07T20:31:26.1377018Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1377389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1377773Z     T=1,
2025-05-07T20:31:26.1377959Z     D=5120,
2025-05-07T20:31:26.1378151Z     contiguous=False,
2025-05-07T20:31:26.1378381Z     compiled=True,
2025-05-07T20:31:26.1378588Z )
2025-05-07T20:31:26.1378785Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1379159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1379646Z     T=4096,
2025-05-07T20:31:26.1379833Z     D=7168,
2025-05-07T20:31:26.1380032Z     contiguous=True,
2025-05-07T20:31:26.1380260Z     compiled=False,
2025-05-07T20:31:26.1380466Z )
2025-05-07T20:31:26.1380665Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1381046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1381422Z     T=4096,
2025-05-07T20:31:26.1381612Z     D=7168,
2025-05-07T20:31:26.1381811Z     contiguous=False,
2025-05-07T20:31:26.1382034Z     compiled=True,
2025-05-07T20:31:26.1382238Z )
2025-05-07T20:31:26.1382437Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1382840Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1383238Z     T=128,
2025-05-07T20:31:26.1383429Z     D=5120,
2025-05-07T20:31:26.1383624Z     contiguous=True,
2025-05-07T20:31:26.1383841Z     compiled=False,
2025-05-07T20:31:26.1384049Z )
2025-05-07T20:31:26.1384248Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1384625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1385010Z     T=128,
2025-05-07T20:31:26.1385202Z     D=5120,
2025-05-07T20:31:26.1385390Z     contiguous=False,
2025-05-07T20:31:26.1385621Z     compiled=False,
2025-05-07T20:31:26.1385829Z )
2025-05-07T20:31:26.1386020Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1386395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1386779Z     T=1,
2025-05-07T20:31:26.1386958Z     D=5120,
2025-05-07T20:31:26.1387159Z     contiguous=True,
2025-05-07T20:31:26.1387385Z     compiled=False,
2025-05-07T20:31:26.1387582Z )
2025-05-07T20:31:26.1387780Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1388159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1388537Z     T=2048,
2025-05-07T20:31:26.1388728Z     D=7168,
2025-05-07T20:31:26.1388943Z     contiguous=False,
2025-05-07T20:31:26.1389172Z     compiled=True,
2025-05-07T20:31:26.1389464Z )
2025-05-07T20:31:26.1389669Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1390044Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1390429Z     T=2048,
2025-05-07T20:31:26.1390613Z     D=7168,
2025-05-07T20:31:26.1390811Z     contiguous=False,
2025-05-07T20:31:26.1391044Z     compiled=False,
2025-05-07T20:31:26.1391248Z )
2025-05-07T20:31:26.1391451Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1391830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1392207Z     T=16384,
2025-05-07T20:31:26.1392409Z     D=7168,
2025-05-07T20:31:26.1392609Z     contiguous=False,
2025-05-07T20:31:26.1392835Z     compiled=True,
2025-05-07T20:31:26.1393043Z )
2025-05-07T20:31:26.1393247Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1393616Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1394004Z     T=16384,
2025-05-07T20:31:26.1394199Z     D=7168,
2025-05-07T20:31:26.1394402Z     contiguous=True,
2025-05-07T20:31:26.1394628Z     compiled=True,
2025-05-07T20:31:26.1394834Z )
2025-05-07T20:31:26.1395024Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1395400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1395782Z     T=4096,
2025-05-07T20:31:26.1395972Z     D=7168,
2025-05-07T20:31:26.1396176Z     contiguous=True,
2025-05-07T20:31:26.1396410Z     compiled=True,
2025-05-07T20:31:26.1396610Z )
2025-05-07T20:31:26.1396813Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1397189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1397564Z     T=2048,
2025-05-07T20:31:26.1397757Z     D=5120,
2025-05-07T20:31:26.1397965Z     contiguous=False,
2025-05-07T20:31:26.1398187Z     compiled=False,
2025-05-07T20:31:26.1398401Z )
2025-05-07T20:31:26.1398602Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1399068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1399796Z     T=2048,
2025-05-07T20:31:26.1400055Z     D=5120,
2025-05-07T20:31:26.1409239Z     contiguous=True,
2025-05-07T20:31:26.1409483Z     compiled=False,
2025-05-07T20:31:26.1409689Z )
2025-05-07T20:31:26.1409892Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1410277Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1410663Z     T=128,
2025-05-07T20:31:26.1410844Z     D=7168,
2025-05-07T20:31:26.1411045Z     contiguous=False,
2025-05-07T20:31:26.1411276Z     compiled=True,
2025-05-07T20:31:26.1411476Z )
2025-05-07T20:31:26.1411678Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1412059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1412432Z     T=16384,
2025-05-07T20:31:26.1412631Z     D=5120,
2025-05-07T20:31:26.1412830Z     contiguous=True,
2025-05-07T20:31:26.1413044Z     compiled=True,
2025-05-07T20:31:26.1413251Z )
2025-05-07T20:31:26.1413460Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1413831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1414211Z     T=2048,
2025-05-07T20:31:26.1414399Z     D=5120,
2025-05-07T20:31:26.1414591Z     contiguous=False,
2025-05-07T20:31:26.1414816Z     compiled=True,
2025-05-07T20:31:26.1415019Z )
2025-05-07T20:31:26.1415213Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1415585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1415968Z     T=16384,
2025-05-07T20:31:26.1416155Z     D=5120,
2025-05-07T20:31:26.1416354Z     contiguous=True,
2025-05-07T20:31:26.1416588Z     compiled=False,
2025-05-07T20:31:26.1416805Z )
2025-05-07T20:31:26.1416996Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1417375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1417763Z     T=16384,
2025-05-07T20:31:26.1417951Z     D=7168,
2025-05-07T20:31:26.1418148Z     contiguous=False,
2025-05-07T20:31:26.1418379Z     compiled=False,
2025-05-07T20:31:26.1418704Z )
2025-05-07T20:31:26.1418903Z Trying example: test_silu_mul(
2025-05-07T20:31:26.1419282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.1419652Z     T=16384,
2025-05-07T20:31:26.1419844Z     D=7168,
2025-05-07T20:31:26.1420043Z     contiguous=True,
2025-05-07T20:31:26.1420266Z     compiled=False,
2025-05-07T20:31:26.1420484Z )
2025-05-07T20:31:26.1420678Z PASSED
2025-05-07T20:31:26.2038905Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.2039994Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:26.2041361Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.2043201Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.2044360Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2045673Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.2047052Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.2048393Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2049647Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.2051033Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.2052099Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2053380Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.2054642Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:26.2055875Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.2057088Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:26.2057920Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2058947Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:26.2060127Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:26.2060927Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:26.2062141Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.2063595Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.2064734Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.2065790Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:26.2066974Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.2068336Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.2069394Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.2070315Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.2071168Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:26.2072211Z W0507 20:31:26.202000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.2197085Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.2198321Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:26.2199657Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.2201106Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.2202335Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2204021Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.2205402Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.2206386Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2207915Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.2209289Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.2210344Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2211621Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.2212923Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:26.2214140Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.2215344Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:26.2216168Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2217189Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:26.2218382Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:26.2219184Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:26.2220389Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.2221666Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.2222782Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.2223829Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:26.2225018Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.2226371Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.2227429Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.2228340Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.2229073Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:26.2230093Z W0507 20:31:26.218000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.2581620Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.2582840Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:26.2584173Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.2585620Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.2586616Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2587917Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.2589297Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.2590276Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2591849Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.2593292Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.2594348Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2595630Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.2596877Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:26.2598106Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.2599312Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:26.2600139Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2601163Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:26.2602184Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:26.2602978Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:26.2604508Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.2605792Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.2606910Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.2607958Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:26.2609137Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.2610502Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.2611573Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.2612487Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.2613224Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:26.2614329Z W0507 20:31:26.257000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.2621989Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.2623331Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:26.2625012Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.2626804Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.2628021Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2629675Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.2631420Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.2632644Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2634189Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.2635926Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.2637369Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2639140Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.2640395Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:26.2641615Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.2642822Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:26.2643781Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:26.2644804Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:26.2645824Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:26.2646618Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:26.2647987Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.2649279Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.2650401Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.2651465Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:26.2652644Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.2654008Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.2655077Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.2655994Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.2656737Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:26.2657751Z W0507 20:31:26.261000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.6790521Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:26.6792582Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:26.6793211Z     T=1,
2025-05-07T20:31:26.6793421Z     D=5120,
2025-05-07T20:31:26.6793620Z     scale_ub=None,
2025-05-07T20:31:26.6793834Z     contiguous=True,
2025-05-07T20:31:26.6794061Z     compiled=True,
2025-05-07T20:31:26.6794266Z )
2025-05-07T20:31:26.6794589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:26.6795075Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:26.6795335Z 
2025-05-07T20:31:26.6795416Z     @given(
2025-05-07T20:31:26.6795648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:26.6795964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:26.6796262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:26.6796590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:26.6796923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:26.6797210Z     )
2025-05-07T20:31:26.6797561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:26.6798005Z     def test_silu_mul_quant(
2025-05-07T20:31:26.6798269Z         self,
2025-05-07T20:31:26.6798469Z         T: int,
2025-05-07T20:31:26.6798663Z         D: int,
2025-05-07T20:31:26.6798883Z         scale_ub: Optional[float],
2025-05-07T20:31:26.6799158Z         contiguous: bool,
2025-05-07T20:31:26.6799391Z         compiled: bool,
2025-05-07T20:31:26.6799621Z     ) -> None:
2025-05-07T20:31:26.6799844Z         torch.manual_seed(2025)
2025-05-07T20:31:26.6800082Z     
2025-05-07T20:31:26.6800361Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:26.6800709Z     
2025-05-07T20:31:26.6800896Z         x_sign = torch.sign(x)
2025-05-07T20:31:26.6801191Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:26.6801510Z         x = x_sign * x_clamp
2025-05-07T20:31:26.6801914Z         x0 = x[:, :D]
2025-05-07T20:31:26.6802143Z         x1 = x[:, D:]
2025-05-07T20:31:26.6802358Z     
2025-05-07T20:31:26.6802541Z         if contiguous:
2025-05-07T20:31:26.6802784Z             x0 = x0.contiguous()
2025-05-07T20:31:26.6803044Z             x1 = x1.contiguous()
2025-05-07T20:31:26.6803281Z     
2025-05-07T20:31:26.6803679Z         if scale_ub is not None:
2025-05-07T20:31:26.6803956Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:26.6804295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:26.6804603Z             )
2025-05-07T20:31:26.6804799Z         else:
2025-05-07T20:31:26.6805011Z             scale_ub_tensor = None
2025-05-07T20:31:26.6805258Z     
2025-05-07T20:31:26.6805497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:26.6805814Z             op = silu_mul_quant
2025-05-07T20:31:26.6806056Z             if compiled:
2025-05-07T20:31:26.6806300Z                 op = torch.compile(op)
2025-05-07T20:31:26.6806604Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:26.6806874Z     
2025-05-07T20:31:26.6807069Z         y_fp8, y_scale = fn()
2025-05-07T20:31:26.6807355Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:26.6807639Z     
2025-05-07T20:31:26.6807875Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:26.6808210Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:26.6808503Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:26.6808815Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:26.6809172Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:26.6809480Z     
2025-05-07T20:31:26.6809675Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:26.6809874Z 
2025-05-07T20:31:26.6809976Z moe/activation_test.py:126: 
2025-05-07T20:31:26.6810272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:26.6810607Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:26.6811029Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:26.6811826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:26.6812586Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:26.6813133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:26.6813829Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:26.6814527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:26.6815265Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:26.6816019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:26.6816785Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:26.6817521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:26.6818164Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:26.6818768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:26.6819289Z     fn()
2025-05-07T20:31:26.6819805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:26.6820384Z     self.fn.run(
2025-05-07T20:31:26.6820858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:26.6821391Z     kernel = self.compile(
2025-05-07T20:31:26.6822011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:26.6822680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.6823080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:26.6823312Z 
2025-05-07T20:31:26.6823525Z self = <triton.compiler.compiler.ASTSource object at 0x7f0937530a50>
2025-05-07T20:31:26.6824598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:26.6825981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09369c3060>}
2025-05-07T20:31:26.6827329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:26.6828361Z context = <triton._C.libtriton.ir.context object at 0x7f093c146530>
2025-05-07T20:31:26.6828647Z 
2025-05-07T20:31:26.6828822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:26.6829343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.6829814Z                            module_map=module_map)
2025-05-07T20:31:26.6830185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.6830538Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:26.6830809Z E       ^
2025-05-07T20:31:26.6831274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.6831726Z 
2025-05-07T20:31:26.6832158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:26.6832759Z 
2025-05-07T20:31:26.6832863Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:26.6833280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:26.6833686Z     T=2048,
2025-05-07T20:31:26.6833869Z     D=5120,
2025-05-07T20:31:26.6834062Z     scale_ub=1200.0,
2025-05-07T20:31:26.6834288Z     contiguous=True,
2025-05-07T20:31:26.6834511Z     compiled=False,
2025-05-07T20:31:26.6834714Z )
2025-05-07T20:31:27.0255545Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.0256624Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:27.0257982Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.0259437Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.0260414Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.0261713Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.0263095Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.0264374Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.0265615Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.0266996Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.0268053Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.0269337Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.0270585Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:27.0271803Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.0273016Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:27.0273839Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.0274865Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:27.0276032Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:27.0276829Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:27.0278039Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.0279322Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.0280440Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.0281499Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:27.0282681Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.0284204Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.0285262Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.0286171Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.0287028Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:27.0288065Z W0507 20:31:27.022000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.1223491Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.1226088Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:27.1228745Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.1231626Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.1233138Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.1234450Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.1235840Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.1236824Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.1238724Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.1240263Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.1241609Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.1243288Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.1244723Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:27.1245953Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.1247176Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:27.1248011Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.1249037Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:27.1250295Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:27.1251103Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:27.1252317Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.1253662Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.1254787Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.1255840Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:27.1257031Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.1258397Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.1259471Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.1260395Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.1261147Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:27.1262184Z W0507 20:31:27.119000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.3909285Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.3910379Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:27.3911732Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.3913300Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.3914305Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.3915622Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.3917021Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.3918005Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.3919703Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.3921116Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.3922179Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.3923639Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.3924899Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:27.3926147Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.3927362Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:27.3928184Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.3929219Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:27.3930242Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:27.3931050Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:27.3932486Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.3933832Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.3934954Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.3936005Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:27.3946435Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.3947904Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.3948978Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.3949890Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.3950641Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:27.3951830Z W0507 20:31:27.388000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.4056457Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.4057517Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:27.4058863Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.4060287Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.4061278Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.4062594Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.4064035Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.4065028Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.4066254Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.4067810Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.4068884Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.4070177Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.4071434Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:27.4072657Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.4073874Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:27.4074709Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.4075742Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:27.4076772Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:27.4077566Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:27.4078900Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.4080195Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.4081320Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.4082361Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:27.4083722Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.4085099Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.4086167Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.4087094Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.4087834Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:27.4088868Z W0507 20:31:27.403000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.7225644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:27.7226851Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:27.7229342Z 
2025-05-07T20:31:27.7229617Z     @given(
2025-05-07T20:31:27.7230027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:27.7230574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:27.7231042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:27.7231417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:27.7231816Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:27.7232240Z     )
2025-05-07T20:31:27.7232760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:27.7233460Z     def test_silu_mul_quant(
2025-05-07T20:31:27.7233771Z         self,
2025-05-07T20:31:27.7233975Z         T: int,
2025-05-07T20:31:27.7234173Z         D: int,
2025-05-07T20:31:27.7234499Z         scale_ub: Optional[float],
2025-05-07T20:31:27.7234787Z         contiguous: bool,
2025-05-07T20:31:27.7235049Z         compiled: bool,
2025-05-07T20:31:27.7235374Z     ) -> None:
2025-05-07T20:31:27.7235605Z         torch.manual_seed(2025)
2025-05-07T20:31:27.7235851Z     
2025-05-07T20:31:27.7236140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:27.7236507Z     
2025-05-07T20:31:27.7236709Z         x_sign = torch.sign(x)
2025-05-07T20:31:27.7237015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:27.7237342Z         x = x_sign * x_clamp
2025-05-07T20:31:27.7237584Z         x0 = x[:, :D]
2025-05-07T20:31:27.7237821Z         x1 = x[:, D:]
2025-05-07T20:31:27.7238045Z     
2025-05-07T20:31:27.7238239Z         if contiguous:
2025-05-07T20:31:27.7238837Z             x0 = x0.contiguous()
2025-05-07T20:31:27.7239110Z             x1 = x1.contiguous()
2025-05-07T20:31:27.7239362Z     
2025-05-07T20:31:27.7239570Z         if scale_ub is not None:
2025-05-07T20:31:27.7239839Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:27.7240426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:27.7240761Z             )
2025-05-07T20:31:27.7240957Z         else:
2025-05-07T20:31:27.7241180Z             scale_ub_tensor = None
2025-05-07T20:31:27.7241446Z     
2025-05-07T20:31:27.7241689Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.7242005Z             op = silu_mul_quant
2025-05-07T20:31:27.7242263Z             if compiled:
2025-05-07T20:31:27.7242518Z                 op = torch.compile(op)
2025-05-07T20:31:27.7242813Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.7243101Z     
2025-05-07T20:31:27.7243300Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:27.7243646Z 
2025-05-07T20:31:27.7243748Z moe/activation_test.py:117: 
2025-05-07T20:31:27.7244056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.7244402Z moe/activation_test.py:115: in fn
2025-05-07T20:31:27.7244685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.7245400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:27.7246100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:27.7246651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:27.7247342Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:27.7248021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:27.7248571Z     kernel = self.compile(
2025-05-07T20:31:27.7249126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:27.7249783Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.7250195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.7250556Z 
2025-05-07T20:31:27.7250779Z self = <triton.compiler.compiler.ASTSource object at 0x7f0924577fd0>
2025-05-07T20:31:27.7251854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:27.7253248Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09369deac0>}
2025-05-07T20:31:27.7254593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:27.7255629Z context = <triton._C.libtriton.ir.context object at 0x7f093699d330>
2025-05-07T20:31:27.7255917Z 
2025-05-07T20:31:27.7256098Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:27.7256627Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.7257102Z                            module_map=module_map)
2025-05-07T20:31:27.7257475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.7257824Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.7258094Z E       ^
2025-05-07T20:31:27.7258563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.7259017Z 
2025-05-07T20:31:27.7259445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:27.7259962Z 
2025-05-07T20:31:27.7260067Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:27.7260486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:27.7260978Z     T=2048,
2025-05-07T20:31:27.7261179Z     D=5120,
2025-05-07T20:31:27.7261377Z     scale_ub=1200.0,
2025-05-07T20:31:27.7261613Z     contiguous=True,
2025-05-07T20:31:27.7261836Z     compiled=True,
2025-05-07T20:31:27.7262043Z )
2025-05-07T20:31:27.7262373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:27.7262900Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:27.7263199Z 
2025-05-07T20:31:27.7263279Z     @given(
2025-05-07T20:31:27.7263519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:27.7263842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:27.7264146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:27.7264483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:27.7264816Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:27.7265111Z     )
2025-05-07T20:31:27.7265467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:27.7265925Z     def test_silu_mul_quant(
2025-05-07T20:31:27.7266176Z         self,
2025-05-07T20:31:27.7266370Z         T: int,
2025-05-07T20:31:27.7266576Z         D: int,
2025-05-07T20:31:27.7266801Z         scale_ub: Optional[float],
2025-05-07T20:31:27.7267072Z         contiguous: bool,
2025-05-07T20:31:27.7267320Z         compiled: bool,
2025-05-07T20:31:27.7267553Z     ) -> None:
2025-05-07T20:31:27.7267771Z         torch.manual_seed(2025)
2025-05-07T20:31:27.7268021Z     
2025-05-07T20:31:27.7268303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:27.7268646Z     
2025-05-07T20:31:27.7268854Z         x_sign = torch.sign(x)
2025-05-07T20:31:27.7269155Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:27.7269466Z         x = x_sign * x_clamp
2025-05-07T20:31:27.7269714Z         x0 = x[:, :D]
2025-05-07T20:31:27.7269935Z         x1 = x[:, D:]
2025-05-07T20:31:27.7270142Z     
2025-05-07T20:31:27.7270341Z         if contiguous:
2025-05-07T20:31:27.7270664Z             x0 = x0.contiguous()
2025-05-07T20:31:27.7270928Z             x1 = x1.contiguous()
2025-05-07T20:31:27.7271166Z     
2025-05-07T20:31:27.7271363Z         if scale_ub is not None:
2025-05-07T20:31:27.7271642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:27.7271976Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:27.7272286Z             )
2025-05-07T20:31:27.7272484Z         else:
2025-05-07T20:31:27.7272690Z             scale_ub_tensor = None
2025-05-07T20:31:27.7272971Z     
2025-05-07T20:31:27.7273240Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.7273552Z             op = silu_mul_quant
2025-05-07T20:31:27.7273812Z             if compiled:
2025-05-07T20:31:27.7274067Z                 op = torch.compile(op)
2025-05-07T20:31:27.7274364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.7274642Z     
2025-05-07T20:31:27.7274844Z         y_fp8, y_scale = fn()
2025-05-07T20:31:27.7275136Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:27.7275433Z     
2025-05-07T20:31:27.7275676Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.7276011Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:27.7276304Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:27.7276630Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:27.7276991Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:27.7277299Z     
2025-05-07T20:31:27.7277506Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:27.7277699Z 
2025-05-07T20:31:27.7277806Z moe/activation_test.py:126: 
2025-05-07T20:31:27.7278101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.7278448Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:27.7278783Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:27.7279814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:27.7280601Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:27.7281159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:27.7281855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:27.7282560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:27.7283296Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:27.7284213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:27.7284977Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:27.7285723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:27.7286380Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:27.7286995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:27.7287531Z     fn()
2025-05-07T20:31:27.7288048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:27.7288644Z     self.fn.run(
2025-05-07T20:31:27.7289131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:27.7289669Z     kernel = self.compile(
2025-05-07T20:31:27.7290265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:27.7291201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.7291768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.7292016Z 
2025-05-07T20:31:27.7292228Z self = <triton.compiler.compiler.ASTSource object at 0x7f093632c690>
2025-05-07T20:31:27.7293316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:27.7294687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f093c5387c0>}
2025-05-07T20:31:27.7296036Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:27.7297072Z context = <triton._C.libtriton.ir.context object at 0x7f093632fcb0>
2025-05-07T20:31:27.7297370Z 
2025-05-07T20:31:27.7297541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:27.7298070Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.7298544Z                            module_map=module_map)
2025-05-07T20:31:27.7298920Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.7299280Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:27.7299554Z E       ^
2025-05-07T20:31:27.7300026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.7300481Z 
2025-05-07T20:31:27.7300905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:27.7301429Z 
2025-05-07T20:31:27.7301536Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:27.7302077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:27.7302485Z     T=16384,
2025-05-07T20:31:27.7302681Z     D=7168,
2025-05-07T20:31:27.7302881Z     scale_ub=1200.0,
2025-05-07T20:31:27.7303109Z     contiguous=False,
2025-05-07T20:31:27.7303334Z     compiled=False,
2025-05-07T20:31:27.7303542Z )
2025-05-07T20:31:27.9702724Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.9703867Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:27.9705227Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.9706678Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.9707660Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.9708971Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.9710363Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.9711351Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.9712858Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.9714243Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.9715308Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.9716592Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.9717850Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:27.9719067Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.9720276Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:27.9721115Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:27.9722148Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:27.9723355Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:27.9724260Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:27.9725474Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.9726757Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.9727876Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.9728925Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:27.9730112Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.9731473Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.9732535Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.9733500Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.9734240Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:27.9735267Z W0507 20:31:27.967000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.0417146Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.0418201Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:28.0419539Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.0420976Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.0421962Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.0423302Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.0424713Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.0425696Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.0427167Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.0428571Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.0429642Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.0430935Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.0432202Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:28.0433436Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.0434648Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:28.0435482Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.0436514Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:28.0437542Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:28.0438343Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:28.0439986Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.0441282Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.0442412Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:28.0443688Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:28.0444879Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.0446257Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.0447324Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.0448243Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.0448983Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:28.0450132Z W0507 20:31:28.039000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.4577300Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.4578380Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:28.4579721Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.4581156Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.4582154Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.4583483Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.4584863Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.4585849Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.4587088Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.4588750Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.4589814Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.4591086Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.4592346Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:28.4593575Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.4594802Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:28.4595640Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.4596663Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:28.4597693Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:28.4598498Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:28.4599864Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.4601161Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.4602279Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:28.4603329Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:28.4604680Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.4606041Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.4607100Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.4608023Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.4608774Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:28.4609796Z W0507 20:31:28.455000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.4719656Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.4722084Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:28.4724237Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.4725673Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.4726650Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.4727961Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.4729351Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.4730331Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.4731561Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.4733074Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.4734197Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.4735472Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.4736715Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:28.4737937Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.4739364Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:28.4740190Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:28.4741213Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:28.4742235Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:28.4743034Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:28.4744300Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.4745715Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.4746828Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:28.4747875Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:28.4749060Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.4750421Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.4751490Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.4752407Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.4753148Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:28.4754168Z W0507 20:31:28.469000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.2729158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:29.2729922Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:29.2730692Z 
2025-05-07T20:31:29.2730807Z     @given(
2025-05-07T20:31:29.2731046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:29.2731376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:29.2731687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:29.2732027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:29.2732355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:29.2732646Z     )
2025-05-07T20:31:29.2733006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:29.2733453Z     def test_silu_mul_quant(
2025-05-07T20:31:29.2733705Z         self,
2025-05-07T20:31:29.2733988Z         T: int,
2025-05-07T20:31:29.2742417Z         D: int,
2025-05-07T20:31:29.2742658Z         scale_ub: Optional[float],
2025-05-07T20:31:29.2742927Z         contiguous: bool,
2025-05-07T20:31:29.2743179Z         compiled: bool,
2025-05-07T20:31:29.2743449Z     ) -> None:
2025-05-07T20:31:29.2743679Z         torch.manual_seed(2025)
2025-05-07T20:31:29.2743931Z     
2025-05-07T20:31:29.2744210Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:29.2744549Z     
2025-05-07T20:31:29.2744746Z         x_sign = torch.sign(x)
2025-05-07T20:31:29.2745041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:29.2745356Z         x = x_sign * x_clamp
2025-05-07T20:31:29.2745591Z         x0 = x[:, :D]
2025-05-07T20:31:29.2745810Z         x1 = x[:, D:]
2025-05-07T20:31:29.2746020Z     
2025-05-07T20:31:29.2746199Z         if contiguous:
2025-05-07T20:31:29.2746431Z             x0 = x0.contiguous()
2025-05-07T20:31:29.2746690Z             x1 = x1.contiguous()
2025-05-07T20:31:29.2746920Z     
2025-05-07T20:31:29.2747115Z         if scale_ub is not None:
2025-05-07T20:31:29.2747391Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:29.2747724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:29.2748030Z             )
2025-05-07T20:31:29.2748450Z         else:
2025-05-07T20:31:29.2748656Z             scale_ub_tensor = None
2025-05-07T20:31:29.2748910Z     
2025-05-07T20:31:29.2749145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.2749451Z             op = silu_mul_quant
2025-05-07T20:31:29.2749700Z             if compiled:
2025-05-07T20:31:29.2749947Z                 op = torch.compile(op)
2025-05-07T20:31:29.2750246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.2750514Z     
2025-05-07T20:31:29.2750706Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:29.2750870Z 
2025-05-07T20:31:29.2750977Z moe/activation_test.py:117: 
2025-05-07T20:31:29.2751266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.2751598Z moe/activation_test.py:115: in fn
2025-05-07T20:31:29.2751876Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.2752562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:29.2753258Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:29.2753799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:29.2754486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:29.2755146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:29.2755681Z     kernel = self.compile(
2025-05-07T20:31:29.2756231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:29.2756891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.2757284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.2757518Z 
2025-05-07T20:31:29.2757852Z self = <triton.compiler.compiler.ASTSource object at 0x7f09363b03d0>
2025-05-07T20:31:29.2758940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:29.2760366Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09369ddc60>}
2025-05-07T20:31:29.2761704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:29.2762723Z context = <triton._C.libtriton.ir.context object at 0x7f09363bc2b0>
2025-05-07T20:31:29.2763023Z 
2025-05-07T20:31:29.2763192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:29.2763831Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.2764306Z                            module_map=module_map)
2025-05-07T20:31:29.2764665Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.2765016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.2765277Z E       ^
2025-05-07T20:31:29.2765734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.2766191Z 
2025-05-07T20:31:29.2766609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:29.2767133Z 
2025-05-07T20:31:29.2767235Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:29.2767648Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:29.2768041Z     T=1,
2025-05-07T20:31:29.2768227Z     D=7168,
2025-05-07T20:31:29.2768422Z     scale_ub=None,
2025-05-07T20:31:29.2768633Z     contiguous=True,
2025-05-07T20:31:29.2768978Z     compiled=True,
2025-05-07T20:31:29.2769183Z )
2025-05-07T20:31:29.2769494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:29.2769977Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:29.2770235Z 
2025-05-07T20:31:29.2770320Z     @given(
2025-05-07T20:31:29.2770547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:29.2770860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:29.2771170Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:29.2771506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:29.2771831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:29.2772118Z     )
2025-05-07T20:31:29.2772461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:29.2772903Z     def test_silu_mul_quant(
2025-05-07T20:31:29.2773140Z         self,
2025-05-07T20:31:29.2773339Z         T: int,
2025-05-07T20:31:29.2773535Z         D: int,
2025-05-07T20:31:29.2773743Z         scale_ub: Optional[float],
2025-05-07T20:31:29.2774010Z         contiguous: bool,
2025-05-07T20:31:29.2774246Z         compiled: bool,
2025-05-07T20:31:29.2774459Z     ) -> None:
2025-05-07T20:31:29.2774674Z         torch.manual_seed(2025)
2025-05-07T20:31:29.2774913Z     
2025-05-07T20:31:29.2775175Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:29.2775516Z     
2025-05-07T20:31:29.2775704Z         x_sign = torch.sign(x)
2025-05-07T20:31:29.2775986Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:29.2776294Z         x = x_sign * x_clamp
2025-05-07T20:31:29.2776531Z         x0 = x[:, :D]
2025-05-07T20:31:29.2776734Z         x1 = x[:, D:]
2025-05-07T20:31:29.2776943Z     
2025-05-07T20:31:29.2777128Z         if contiguous:
2025-05-07T20:31:29.2777352Z             x0 = x0.contiguous()
2025-05-07T20:31:29.2777685Z             x1 = x1.contiguous()
2025-05-07T20:31:29.2777927Z     
2025-05-07T20:31:29.2778115Z         if scale_ub is not None:
2025-05-07T20:31:29.2778378Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:29.2778706Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:29.2779010Z             )
2025-05-07T20:31:29.2779196Z         else:
2025-05-07T20:31:29.2779400Z             scale_ub_tensor = None
2025-05-07T20:31:29.2779650Z     
2025-05-07T20:31:29.2779873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.2780183Z             op = silu_mul_quant
2025-05-07T20:31:29.2780429Z             if compiled:
2025-05-07T20:31:29.2780670Z                 op = torch.compile(op)
2025-05-07T20:31:29.2780966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.2781237Z     
2025-05-07T20:31:29.2781420Z         y_fp8, y_scale = fn()
2025-05-07T20:31:29.2781700Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:29.2781993Z     
2025-05-07T20:31:29.2782229Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.2782553Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:29.2782840Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:29.2783151Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:29.2783502Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:29.2783808Z     
2025-05-07T20:31:29.2784005Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:29.2784196Z 
2025-05-07T20:31:29.2784293Z moe/activation_test.py:126: 
2025-05-07T20:31:29.2784587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.2784918Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:29.2785247Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:29.2786033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:29.2786875Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:29.2787423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:29.2788097Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:29.2788788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:29.2789512Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:29.2790268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:29.2791012Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:29.2791749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:29.2792395Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:29.2792999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:29.2793512Z     fn()
2025-05-07T20:31:29.2794052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:29.2794652Z     self.fn.run(
2025-05-07T20:31:29.2795112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:29.2795640Z     kernel = self.compile(
2025-05-07T20:31:29.2796184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:29.2796839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.2797226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.2797546Z 
2025-05-07T20:31:29.2797754Z self = <triton.compiler.compiler.ASTSource object at 0x7f090ea37b50>
2025-05-07T20:31:29.2798828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:29.2800192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09367e0360>}
2025-05-07T20:31:29.2801523Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:29.2802545Z context = <triton._C.libtriton.ir.context object at 0x7f090ea4be30>
2025-05-07T20:31:29.2802834Z 
2025-05-07T20:31:29.2803006Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:29.2803626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.2804086Z                            module_map=module_map)
2025-05-07T20:31:29.2804451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.2804810Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:29.2805076Z E       ^
2025-05-07T20:31:29.2805532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.2806017Z 
2025-05-07T20:31:29.2806558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:29.2807082Z 
2025-05-07T20:31:29.2807192Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:29.2807679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:29.2808089Z     T=4096,
2025-05-07T20:31:29.2808375Z     D=5120,
2025-05-07T20:31:29.2808565Z     scale_ub=None,
2025-05-07T20:31:29.2808771Z     contiguous=False,
2025-05-07T20:31:29.2808992Z     compiled=False,
2025-05-07T20:31:29.2809192Z )
2025-05-07T20:31:29.6262345Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:29.6263507Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:29.6264869Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:29.6266363Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:29.6267370Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:29.6268696Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:29.6270104Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.6271096Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:29.6272708Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:29.6274238Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.6275399Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:29.6276763Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:29.6278050Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:29.6279306Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:29.6280543Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:29.6281387Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:29.6282432Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:29.6283721Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:29.6284528Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:29.6285939Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:29.6287238Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:29.6288367Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:29.6289434Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:29.6290630Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:29.6292014Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:29.6293087Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.6294062Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.6294819Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:29.6295964Z W0507 20:31:29.623000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.8740941Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:29.8742045Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:29.8743384Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:29.8744873Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:29.8745869Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:29.8747187Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:29.8748565Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.8749543Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:29.8750769Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:29.8752473Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.8753550Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:29.8754848Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:29.8756093Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:29.8757315Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:29.8758524Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:29.8759340Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:29.8760364Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:29.8761382Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:29.8762174Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:29.8763639Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:29.8764933Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:29.8766047Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:29.8767092Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:29.8768275Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:29.8769628Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:29.8770685Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.8771595Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.8772333Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:29.8773346Z W0507 20:31:29.871000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.2407175Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.2408627Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:30.2409963Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.2411400Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.2412375Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.2413676Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.2415066Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.2416045Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.2417276Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.2418823Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.2419886Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.2421167Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.2422416Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:30.2423641Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.2424902Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:30.2425728Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.2426754Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.2427775Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:30.2428562Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.2429770Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.2431580Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.2432693Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.2433742Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:30.2434918Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.2436277Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.2437346Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.2438256Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.2439299Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:30.2440316Z W0507 20:31:30.238000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.2549559Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.2550782Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:30.2552123Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.2553543Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.2554561Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.2555870Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.2557252Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.2558234Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.2559463Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.2560831Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.2561898Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.2563303Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.2564682Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:30.2565900Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.2567106Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:30.2567945Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.2568969Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.2569991Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:30.2570776Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.2571983Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.2573341Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.2574514Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.2575559Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:30.2576735Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.2578088Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.2579160Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.2580079Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.2580821Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:30.2581836Z W0507 20:31:30.252000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.0235352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.0236063Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:32.0236402Z 
2025-05-07T20:31:32.0236483Z     @given(
2025-05-07T20:31:32.0236755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.0237502Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.0237818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.0238156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.0238770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.0239069Z     )
2025-05-07T20:31:32.0239426Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.0239880Z     def test_silu_mul_quant(
2025-05-07T20:31:32.0240122Z         self,
2025-05-07T20:31:32.0240325Z         T: int,
2025-05-07T20:31:32.0240528Z         D: int,
2025-05-07T20:31:32.0240747Z         scale_ub: Optional[float],
2025-05-07T20:31:32.0241029Z         contiguous: bool,
2025-05-07T20:31:32.0241275Z         compiled: bool,
2025-05-07T20:31:32.0241500Z     ) -> None:
2025-05-07T20:31:32.0241726Z         torch.manual_seed(2025)
2025-05-07T20:31:32.0241970Z     
2025-05-07T20:31:32.0242283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.0242636Z     
2025-05-07T20:31:32.0242829Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.0243129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.0243550Z         x = x_sign * x_clamp
2025-05-07T20:31:32.0243788Z         x0 = x[:, :D]
2025-05-07T20:31:32.0244009Z         x1 = x[:, D:]
2025-05-07T20:31:32.0244246Z     
2025-05-07T20:31:32.0244457Z         if contiguous:
2025-05-07T20:31:32.0244693Z             x0 = x0.contiguous()
2025-05-07T20:31:32.0244953Z             x1 = x1.contiguous()
2025-05-07T20:31:32.0245190Z     
2025-05-07T20:31:32.0245384Z         if scale_ub is not None:
2025-05-07T20:31:32.0245665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.0245997Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.0246312Z             )
2025-05-07T20:31:32.0246515Z         else:
2025-05-07T20:31:32.0246730Z             scale_ub_tensor = None
2025-05-07T20:31:32.0247149Z     
2025-05-07T20:31:32.0247396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.0247715Z             op = silu_mul_quant
2025-05-07T20:31:32.0247961Z             if compiled:
2025-05-07T20:31:32.0248211Z                 op = torch.compile(op)
2025-05-07T20:31:32.0248517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.0248788Z     
2025-05-07T20:31:32.0248982Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:32.0249145Z 
2025-05-07T20:31:32.0249251Z moe/activation_test.py:117: 
2025-05-07T20:31:32.0249540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.0249874Z moe/activation_test.py:115: in fn
2025-05-07T20:31:32.0250155Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.0250851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:32.0251544Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:32.0252101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.0252794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.0253464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.0254006Z     kernel = self.compile(
2025-05-07T20:31:32.0254558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.0255228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.0255628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.0255866Z 
2025-05-07T20:31:32.0256076Z self = <triton.compiler.compiler.ASTSource object at 0x7f090eadfed0>
2025-05-07T20:31:32.0257185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.0258753Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09367e1c60>}
2025-05-07T20:31:32.0260109Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.0261139Z context = <triton._C.libtriton.ir.context object at 0x7f090ea6bd30>
2025-05-07T20:31:32.0261440Z 
2025-05-07T20:31:32.0261608Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.0262141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.0262619Z                            module_map=module_map)
2025-05-07T20:31:32.0262988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.0263350Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.0263616Z E       ^
2025-05-07T20:31:32.0264081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.0264548Z 
2025-05-07T20:31:32.0264977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.0265553Z 
2025-05-07T20:31:32.0265658Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.0266189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.0274887Z     T=4096,
2025-05-07T20:31:32.0275109Z     D=7168,
2025-05-07T20:31:32.0275305Z     scale_ub=None,
2025-05-07T20:31:32.0275542Z     contiguous=False,
2025-05-07T20:31:32.0275784Z     compiled=False,
2025-05-07T20:31:32.0275996Z )
2025-05-07T20:31:32.0276494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.0277010Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:32.0277291Z 
2025-05-07T20:31:32.0277370Z     @given(
2025-05-07T20:31:32.0277610Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.0277932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.0278239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.0278579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.0278919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.0279211Z     )
2025-05-07T20:31:32.0279560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.0280013Z     def test_silu_mul_quant(
2025-05-07T20:31:32.0280267Z         self,
2025-05-07T20:31:32.0280459Z         T: int,
2025-05-07T20:31:32.0280667Z         D: int,
2025-05-07T20:31:32.0280903Z         scale_ub: Optional[float],
2025-05-07T20:31:32.0281179Z         contiguous: bool,
2025-05-07T20:31:32.0281426Z         compiled: bool,
2025-05-07T20:31:32.0281658Z     ) -> None:
2025-05-07T20:31:32.0281871Z         torch.manual_seed(2025)
2025-05-07T20:31:32.0282122Z     
2025-05-07T20:31:32.0282406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.0282748Z     
2025-05-07T20:31:32.0282947Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.0283250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.0283677Z         x = x_sign * x_clamp
2025-05-07T20:31:32.0283925Z         x0 = x[:, :D]
2025-05-07T20:31:32.0284154Z         x1 = x[:, D:]
2025-05-07T20:31:32.0284375Z     
2025-05-07T20:31:32.0284560Z         if contiguous:
2025-05-07T20:31:32.0284799Z             x0 = x0.contiguous()
2025-05-07T20:31:32.0285064Z             x1 = x1.contiguous()
2025-05-07T20:31:32.0285302Z     
2025-05-07T20:31:32.0285510Z         if scale_ub is not None:
2025-05-07T20:31:32.0285894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.0286231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.0286541Z             )
2025-05-07T20:31:32.0286740Z         else:
2025-05-07T20:31:32.0286945Z             scale_ub_tensor = None
2025-05-07T20:31:32.0287202Z     
2025-05-07T20:31:32.0287438Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.0287747Z             op = silu_mul_quant
2025-05-07T20:31:32.0288000Z             if compiled:
2025-05-07T20:31:32.0288254Z                 op = torch.compile(op)
2025-05-07T20:31:32.0288546Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.0288827Z     
2025-05-07T20:31:32.0289027Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:32.0289194Z 
2025-05-07T20:31:32.0289304Z moe/activation_test.py:117: 
2025-05-07T20:31:32.0289597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.0289943Z moe/activation_test.py:115: in fn
2025-05-07T20:31:32.0290242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.0290935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:32.0291640Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:32.0292186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.0292879Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.0293548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.0294092Z     kernel = self.compile(
2025-05-07T20:31:32.0294667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.0295355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.0295856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.0296097Z 
2025-05-07T20:31:32.0296305Z self = <triton.compiler.compiler.ASTSource object at 0x7f090eacb810>
2025-05-07T20:31:32.0297391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.0298773Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09367e2c00>}
2025-05-07T20:31:32.0300112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.0301151Z context = <triton._C.libtriton.ir.context object at 0x7f090ea8baf0>
2025-05-07T20:31:32.0301443Z 
2025-05-07T20:31:32.0301624Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.0302154Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.0302617Z                            module_map=module_map)
2025-05-07T20:31:32.0302993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.0303352Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.0303608Z E       ^
2025-05-07T20:31:32.0304075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.0304540Z 
2025-05-07T20:31:32.0305015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.0305536Z 
2025-05-07T20:31:32.0305639Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.0306055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.0306540Z     T=128,
2025-05-07T20:31:32.0306730Z     D=7168,
2025-05-07T20:31:32.0306923Z     scale_ub=None,
2025-05-07T20:31:32.0307134Z     contiguous=False,
2025-05-07T20:31:32.0307359Z     compiled=True,
2025-05-07T20:31:32.0307565Z )
2025-05-07T20:31:32.0786340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.0787707Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:32.0788260Z 
2025-05-07T20:31:32.0788415Z     @given(
2025-05-07T20:31:32.0788873Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.0789488Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.0790097Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.0790747Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.0791396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.0791973Z     )
2025-05-07T20:31:32.0792684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.0793566Z     def test_silu_mul_quant(
2025-05-07T20:31:32.0794033Z         self,
2025-05-07T20:31:32.0794412Z         T: int,
2025-05-07T20:31:32.0794784Z         D: int,
2025-05-07T20:31:32.0794997Z         scale_ub: Optional[float],
2025-05-07T20:31:32.0795273Z         contiguous: bool,
2025-05-07T20:31:32.0795515Z         compiled: bool,
2025-05-07T20:31:32.0795738Z     ) -> None:
2025-05-07T20:31:32.0795954Z         torch.manual_seed(2025)
2025-05-07T20:31:32.0796196Z     
2025-05-07T20:31:32.0796466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.0796809Z     
2025-05-07T20:31:32.0797005Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.0797296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.0797601Z         x = x_sign * x_clamp
2025-05-07T20:31:32.0797840Z         x0 = x[:, :D]
2025-05-07T20:31:32.0798414Z         x1 = x[:, D:]
2025-05-07T20:31:32.0798624Z     
2025-05-07T20:31:32.0798815Z         if contiguous:
2025-05-07T20:31:32.0799047Z             x0 = x0.contiguous()
2025-05-07T20:31:32.0799301Z             x1 = x1.contiguous()
2025-05-07T20:31:32.0799542Z     
2025-05-07T20:31:32.0799730Z         if scale_ub is not None:
2025-05-07T20:31:32.0800000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.0800346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.0800654Z             )
2025-05-07T20:31:32.0800844Z         else:
2025-05-07T20:31:32.0801054Z             scale_ub_tensor = None
2025-05-07T20:31:32.0801307Z     
2025-05-07T20:31:32.0801534Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.0801849Z             op = silu_mul_quant
2025-05-07T20:31:32.0802097Z             if compiled:
2025-05-07T20:31:32.0802343Z                 op = torch.compile(op)
2025-05-07T20:31:32.0802641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.0802925Z     
2025-05-07T20:31:32.0803120Z         y_fp8, y_scale = fn()
2025-05-07T20:31:32.0803534Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:32.0803829Z     
2025-05-07T20:31:32.0804075Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.0804406Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:32.0804700Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:32.0805016Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:32.0805374Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:32.0805687Z     
2025-05-07T20:31:32.0805892Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:32.0806084Z 
2025-05-07T20:31:32.0806193Z moe/activation_test.py:126: 
2025-05-07T20:31:32.0806485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.0806826Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:32.0807326Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:32.0808117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:32.0808881Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:32.0809436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.0810131Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.0810822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:32.0811557Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:32.0812321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:32.0813075Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:32.0813822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:32.0814469Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:32.0815074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:32.0815596Z     fn()
2025-05-07T20:31:32.0816107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:32.0816695Z     self.fn.run(
2025-05-07T20:31:32.0817169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.0817703Z     kernel = self.compile(
2025-05-07T20:31:32.0818361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.0819036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.0819433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.0819670Z 
2025-05-07T20:31:32.0819878Z self = <triton.compiler.compiler.ASTSource object at 0x7f090e1c3590>
2025-05-07T20:31:32.0820963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.0822350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09367e3f60>}
2025-05-07T20:31:32.0823714Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.0824759Z context = <triton._C.libtriton.ir.context object at 0x7f090e1e09b0>
2025-05-07T20:31:32.0825093Z 
2025-05-07T20:31:32.0825260Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.0825795Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.0826266Z                            module_map=module_map)
2025-05-07T20:31:32.0826627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.0826984Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:32.0827251Z E       ^
2025-05-07T20:31:32.0827716Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.0828179Z 
2025-05-07T20:31:32.0828602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.0829134Z 
2025-05-07T20:31:32.0829325Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.0829741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.0830141Z     T=128,
2025-05-07T20:31:32.0830333Z     D=7168,
2025-05-07T20:31:32.0830528Z     scale_ub=None,
2025-05-07T20:31:32.0830737Z     contiguous=False,
2025-05-07T20:31:32.0830967Z     compiled=False,
2025-05-07T20:31:32.0831180Z )
2025-05-07T20:31:32.2356679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.2357322Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:32.2357670Z 
2025-05-07T20:31:32.2357753Z     @given(
2025-05-07T20:31:32.2357987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.2358297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.2358606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.2358962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.2359302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.2359591Z     )
2025-05-07T20:31:32.2359945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.2360386Z     def test_silu_mul_quant(
2025-05-07T20:31:32.2360638Z         self,
2025-05-07T20:31:32.2360835Z         T: int,
2025-05-07T20:31:32.2361033Z         D: int,
2025-05-07T20:31:32.2361243Z         scale_ub: Optional[float],
2025-05-07T20:31:32.2361514Z         contiguous: bool,
2025-05-07T20:31:32.2361750Z         compiled: bool,
2025-05-07T20:31:32.2361979Z     ) -> None:
2025-05-07T20:31:32.2362198Z         torch.manual_seed(2025)
2025-05-07T20:31:32.2362442Z     
2025-05-07T20:31:32.2362717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.2363062Z     
2025-05-07T20:31:32.2363255Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.2363700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.2364371Z         x = x_sign * x_clamp
2025-05-07T20:31:32.2364623Z         x0 = x[:, :D]
2025-05-07T20:31:32.2364850Z         x1 = x[:, D:]
2025-05-07T20:31:32.2365062Z     
2025-05-07T20:31:32.2365248Z         if contiguous:
2025-05-07T20:31:32.2365486Z             x0 = x0.contiguous()
2025-05-07T20:31:32.2365745Z             x1 = x1.contiguous()
2025-05-07T20:31:32.2365982Z     
2025-05-07T20:31:32.2366180Z         if scale_ub is not None:
2025-05-07T20:31:32.2366458Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.2366793Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.2367111Z             )
2025-05-07T20:31:32.2367308Z         else:
2025-05-07T20:31:32.2367517Z             scale_ub_tensor = None
2025-05-07T20:31:32.2367771Z     
2025-05-07T20:31:32.2368007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.2368322Z             op = silu_mul_quant
2025-05-07T20:31:32.2368579Z             if compiled:
2025-05-07T20:31:32.2368838Z                 op = torch.compile(op)
2025-05-07T20:31:32.2369135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.2369412Z     
2025-05-07T20:31:32.2369609Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:32.2369781Z 
2025-05-07T20:31:32.2369887Z moe/activation_test.py:117: 
2025-05-07T20:31:32.2370288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.2370640Z moe/activation_test.py:115: in fn
2025-05-07T20:31:32.2370925Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.2371620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:32.2372323Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:32.2372866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.2373562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.2374444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.2374989Z     kernel = self.compile(
2025-05-07T20:31:32.2375539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.2376209Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.2376605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.2376842Z 
2025-05-07T20:31:32.2377050Z self = <triton.compiler.compiler.ASTSource object at 0x7f090e28b610>
2025-05-07T20:31:32.2378135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.2379526Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090e923ec0>}
2025-05-07T20:31:32.2380874Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.2381909Z context = <triton._C.libtriton.ir.context object at 0x7f090e2c6030>
2025-05-07T20:31:32.2382203Z 
2025-05-07T20:31:32.2382374Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.2382896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.2383368Z                            module_map=module_map)
2025-05-07T20:31:32.2383736Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.2384098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.2384361Z E       ^
2025-05-07T20:31:32.2384924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.2385430Z 
2025-05-07T20:31:32.2385862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.2386379Z 
2025-05-07T20:31:32.2386483Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.2386900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.2387303Z     T=4096,
2025-05-07T20:31:32.2387495Z     D=5120,
2025-05-07T20:31:32.2387685Z     scale_ub=1200.0,
2025-05-07T20:31:32.2387910Z     contiguous=True,
2025-05-07T20:31:32.2388134Z     compiled=False,
2025-05-07T20:31:32.2388337Z )
2025-05-07T20:31:32.2388659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.2389160Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:32.2389441Z 
2025-05-07T20:31:32.2389521Z     @given(
2025-05-07T20:31:32.2389751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.2390063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.2390364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.2390695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.2391022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.2391312Z     )
2025-05-07T20:31:32.2391656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.2392098Z     def test_silu_mul_quant(
2025-05-07T20:31:32.2392341Z         self,
2025-05-07T20:31:32.2392529Z         T: int,
2025-05-07T20:31:32.2392726Z         D: int,
2025-05-07T20:31:32.2392944Z         scale_ub: Optional[float],
2025-05-07T20:31:32.2393208Z         contiguous: bool,
2025-05-07T20:31:32.2393449Z         compiled: bool,
2025-05-07T20:31:32.2393673Z     ) -> None:
2025-05-07T20:31:32.2393895Z         torch.manual_seed(2025)
2025-05-07T20:31:32.2394226Z     
2025-05-07T20:31:32.2394496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.2394854Z     
2025-05-07T20:31:32.2395084Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.2395380Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.2395691Z         x = x_sign * x_clamp
2025-05-07T20:31:32.2395931Z         x0 = x[:, :D]
2025-05-07T20:31:32.2396151Z         x1 = x[:, D:]
2025-05-07T20:31:32.2396364Z     
2025-05-07T20:31:32.2396544Z         if contiguous:
2025-05-07T20:31:32.2396780Z             x0 = x0.contiguous()
2025-05-07T20:31:32.2397038Z             x1 = x1.contiguous()
2025-05-07T20:31:32.2397272Z     
2025-05-07T20:31:32.2397467Z         if scale_ub is not None:
2025-05-07T20:31:32.2397739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.2398069Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.2398380Z             )
2025-05-07T20:31:32.2398589Z         else:
2025-05-07T20:31:32.2398797Z             scale_ub_tensor = None
2025-05-07T20:31:32.2399056Z     
2025-05-07T20:31:32.2399294Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.2399600Z             op = silu_mul_quant
2025-05-07T20:31:32.2399852Z             if compiled:
2025-05-07T20:31:32.2400097Z                 op = torch.compile(op)
2025-05-07T20:31:32.2400387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.2400664Z     
2025-05-07T20:31:32.2400858Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:32.2401022Z 
2025-05-07T20:31:32.2401124Z moe/activation_test.py:117: 
2025-05-07T20:31:32.2401410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.2401744Z moe/activation_test.py:115: in fn
2025-05-07T20:31:32.2402025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.2402798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:32.2403646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:32.2404192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.2404884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.2405550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.2406090Z     kernel = self.compile(
2025-05-07T20:31:32.2406639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.2407298Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.2407695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.2407931Z 
2025-05-07T20:31:32.2408145Z self = <triton.compiler.compiler.ASTSource object at 0x7f090e0b8e50>
2025-05-07T20:31:32.2409233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.2410606Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090e720720>}
2025-05-07T20:31:32.2411948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.2412980Z context = <triton._C.libtriton.ir.context object at 0x7f090e09ccf0>
2025-05-07T20:31:32.2413275Z 
2025-05-07T20:31:32.2413444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.2413977Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.2414527Z                            module_map=module_map)
2025-05-07T20:31:32.2414892Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.2415251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.2415508Z E       ^
2025-05-07T20:31:32.2415977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.2416438Z 
2025-05-07T20:31:32.2416861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.2417382Z 
2025-05-07T20:31:32.2417495Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.2417904Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.2418307Z     T=1,
2025-05-07T20:31:32.2418499Z     D=5120,
2025-05-07T20:31:32.2418685Z     scale_ub=None,
2025-05-07T20:31:32.2418906Z     contiguous=True,
2025-05-07T20:31:32.2419139Z     compiled=True,
2025-05-07T20:31:32.2419338Z )
2025-05-07T20:31:32.5657986Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.5659050Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:32.5660397Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.5661845Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.5663197Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.5664577Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.5665989Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.5666970Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.5668210Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.5669602Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.5670669Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.5671948Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.5673198Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:32.5674431Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.5675812Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:32.5676646Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.5677675Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.5678693Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:32.5688062Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.5689313Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.5690619Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.5691749Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.5692806Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:32.5694103Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.5695472Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.5696546Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.5697472Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.5698214Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:32.5699244Z W0507 20:31:32.563000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.6514318Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.6515776Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:32.6517134Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.6518572Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.6519563Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6521246Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.6522642Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.6523797Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6525060Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.6526481Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.6527564Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6528856Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.6530116Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:32.6531497Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.6532716Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:32.6533554Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6534589Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.6535625Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:32.6536429Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.6537653Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.6539230Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.6540363Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.6541417Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:32.6542599Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.6543977Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.6545172Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.6546094Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.6546843Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:32.6547866Z W0507 20:31:32.649000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.9136342Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.9137495Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:32.9139154Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.9140621Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.9141605Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.9143370Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.9144836Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.9145839Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.9147089Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.9148486Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.9149557Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.9150860Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.9152123Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:32.9153365Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.9154599Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:32.9155628Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.9156666Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.9157700Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:32.9158513Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.9159732Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.9161045Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.9162177Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.9163235Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:32.9164570Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.9165951Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.9167252Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.9168199Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.9168962Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:32.9169991Z W0507 20:31:32.911000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.9276756Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.9277837Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:32.9279184Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.9280792Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.9281793Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.9283111Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.9284828Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.9285808Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.9287035Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.9288409Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.9289470Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.9290744Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.9291990Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:32.9293206Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.9294407Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:32.9295353Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.9296379Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.9297403Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:32.9298202Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.9299419Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.9300705Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.9301822Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.9302863Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:32.9304045Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.9305448Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.9306505Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.9307508Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.9308248Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:32.9309267Z W0507 20:31:32.925000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.1258632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:33.1259281Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:33.1259557Z 
2025-05-07T20:31:33.1259640Z     @given(
2025-05-07T20:31:33.1259892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:33.1260213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:33.1260544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:33.1260893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:33.1261224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:33.1261503Z     )
2025-05-07T20:31:33.1261857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:33.1262300Z     def test_silu_mul_quant(
2025-05-07T20:31:33.1262535Z         self,
2025-05-07T20:31:33.1262733Z         T: int,
2025-05-07T20:31:33.1262935Z         D: int,
2025-05-07T20:31:33.1263146Z         scale_ub: Optional[float],
2025-05-07T20:31:33.1263421Z         contiguous: bool,
2025-05-07T20:31:33.1263664Z         compiled: bool,
2025-05-07T20:31:33.1263890Z     ) -> None:
2025-05-07T20:31:33.1264109Z         torch.manual_seed(2025)
2025-05-07T20:31:33.1264359Z     
2025-05-07T20:31:33.1264635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:33.1264992Z     
2025-05-07T20:31:33.1265562Z         x_sign = torch.sign(x)
2025-05-07T20:31:33.1265865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:33.1266173Z         x = x_sign * x_clamp
2025-05-07T20:31:33.1266414Z         x0 = x[:, :D]
2025-05-07T20:31:33.1266637Z         x1 = x[:, D:]
2025-05-07T20:31:33.1266849Z     
2025-05-07T20:31:33.1267041Z         if contiguous:
2025-05-07T20:31:33.1267277Z             x0 = x0.contiguous()
2025-05-07T20:31:33.1267530Z             x1 = x1.contiguous()
2025-05-07T20:31:33.1267776Z     
2025-05-07T20:31:33.1267968Z         if scale_ub is not None:
2025-05-07T20:31:33.1268237Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:33.1268576Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:33.1268886Z             )
2025-05-07T20:31:33.1269075Z         else:
2025-05-07T20:31:33.1269287Z             scale_ub_tensor = None
2025-05-07T20:31:33.1269538Z     
2025-05-07T20:31:33.1269766Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:33.1270082Z             op = silu_mul_quant
2025-05-07T20:31:33.1270333Z             if compiled:
2025-05-07T20:31:33.1270581Z                 op = torch.compile(op)
2025-05-07T20:31:33.1270875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:33.1271150Z     
2025-05-07T20:31:33.1271344Z         y_fp8, y_scale = fn()
2025-05-07T20:31:33.1271625Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:33.1271914Z     
2025-05-07T20:31:33.1272151Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:33.1272481Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:33.1272782Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:33.1273100Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:33.1273454Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:33.1273769Z     
2025-05-07T20:31:33.1273971Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:33.1274166Z 
2025-05-07T20:31:33.1274281Z moe/activation_test.py:126: 
2025-05-07T20:31:33.1274741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.1275084Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:33.1275415Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:33.1276204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:33.1276961Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:33.1277510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:33.1278204Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:33.1278892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:33.1279622Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:33.1280388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:33.1281145Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:33.1281874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:33.1282521Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:33.1283131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:33.1283818Z     fn()
2025-05-07T20:31:33.1284332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:33.1284916Z     self.fn.run(
2025-05-07T20:31:33.1285502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:33.1286039Z     kernel = self.compile(
2025-05-07T20:31:33.1286587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:33.1287250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.1287643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.1287880Z 
2025-05-07T20:31:33.1288090Z self = <triton.compiler.compiler.ASTSource object at 0x7f090e042350>
2025-05-07T20:31:33.1289169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:33.1290558Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0924366660>}
2025-05-07T20:31:33.1291908Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:33.1292933Z context = <triton._C.libtriton.ir.context object at 0x7f090e03a230>
2025-05-07T20:31:33.1293226Z 
2025-05-07T20:31:33.1293394Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:33.1293920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.1294388Z                            module_map=module_map)
2025-05-07T20:31:33.1294747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.1295118Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:33.1295389Z E       ^
2025-05-07T20:31:33.1295854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.1296450Z 
2025-05-07T20:31:33.1296872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:33.1297395Z 
2025-05-07T20:31:33.1297499Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:33.1297918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:33.1298314Z     T=2048,
2025-05-07T20:31:33.1298502Z     D=5120,
2025-05-07T20:31:33.1298699Z     scale_ub=None,
2025-05-07T20:31:33.1298911Z     contiguous=True,
2025-05-07T20:31:33.1299137Z     compiled=True,
2025-05-07T20:31:33.1299347Z )
2025-05-07T20:31:33.4458250Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:33.4459385Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:33.4460766Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:33.4462223Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:33.4463210Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.4464530Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:33.4466285Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.4467288Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.4468651Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:33.4470058Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.4471141Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.4472427Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:33.4473693Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:33.4474924Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:33.4476366Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:33.4477274Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.4478540Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:33.4479898Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:33.4480708Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:33.4481930Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:33.4483229Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:33.4484557Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:33.4485713Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:33.4486907Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:33.4488264Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:33.4489319Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.4490420Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:33.4491163Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:33.4492179Z W0507 20:31:33.443000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.5329717Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:33.5330818Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:33.5332199Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:33.5333673Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:33.5334673Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.5336064Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:33.5337483Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.5339222Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.5340460Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:33.5341855Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.5342934Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.5344224Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:33.5345484Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:33.5346702Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:33.5347915Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:33.5348744Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.5349936Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:33.5350971Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:33.5351759Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:33.5352972Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:33.5354262Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:33.5355387Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:33.5356440Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:33.5357619Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:33.5358975Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:33.5360036Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.5360952Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:33.5361814Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:33.5362840Z W0507 20:31:33.530000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.7951631Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:33.7952730Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:33.7954096Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:33.7955602Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:33.7956572Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.7957880Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:33.7959264Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.7960662Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.7961904Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:33.7963276Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.7964555Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.7965844Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:33.7967098Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:33.7968324Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:33.7969529Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:33.7970361Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.7971388Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:33.7972416Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:33.7973382Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:33.7974587Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:33.7975871Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:33.7976990Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:33.7978037Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:33.7979216Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:33.7980584Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:33.7981644Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.7982557Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:33.7983299Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:33.7984392Z W0507 20:31:33.792000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.8097386Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:33.8098461Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:33.8099787Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:33.8101220Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:33.8102193Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.8103494Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:33.8104876Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.8105851Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.8107076Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:33.8108641Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.8109689Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.8110962Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:33.8112209Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:33.8113435Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:33.8114643Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:33.8115468Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.8116492Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:33.8117509Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:33.8118428Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:33.8119641Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:33.8120924Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:33.8122045Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:33.8123092Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:33.8124425Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:33.8125788Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:33.8126858Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.8127776Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:33.8128527Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:33.8129556Z W0507 20:31:33.807000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.1948708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:34.1949362Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:34.1949642Z 
2025-05-07T20:31:34.1949731Z     @given(
2025-05-07T20:31:34.1949967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:34.1950293Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:34.1950606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:34.1950937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:34.1951272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:34.1951567Z     )
2025-05-07T20:31:34.1951931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:34.1952374Z     def test_silu_mul_quant(
2025-05-07T20:31:34.1952620Z         self,
2025-05-07T20:31:34.1952843Z         T: int,
2025-05-07T20:31:34.1953050Z         D: int,
2025-05-07T20:31:34.1953270Z         scale_ub: Optional[float],
2025-05-07T20:31:34.1953544Z         contiguous: bool,
2025-05-07T20:31:34.1953782Z         compiled: bool,
2025-05-07T20:31:34.1954029Z     ) -> None:
2025-05-07T20:31:34.1954252Z         torch.manual_seed(2025)
2025-05-07T20:31:34.1954497Z     
2025-05-07T20:31:34.1954774Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:34.1955115Z     
2025-05-07T20:31:34.1955318Z         x_sign = torch.sign(x)
2025-05-07T20:31:34.1955615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:34.1955923Z         x = x_sign * x_clamp
2025-05-07T20:31:34.1956169Z         x0 = x[:, :D]
2025-05-07T20:31:34.1956392Z         x1 = x[:, D:]
2025-05-07T20:31:34.1956599Z     
2025-05-07T20:31:34.1956794Z         if contiguous:
2025-05-07T20:31:34.1957032Z             x0 = x0.contiguous()
2025-05-07T20:31:34.1957287Z             x1 = x1.contiguous()
2025-05-07T20:31:34.1958002Z     
2025-05-07T20:31:34.1958418Z         if scale_ub is not None:
2025-05-07T20:31:34.1966964Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:34.1967327Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:34.1967646Z             )
2025-05-07T20:31:34.1967843Z         else:
2025-05-07T20:31:34.1968064Z             scale_ub_tensor = None
2025-05-07T20:31:34.1968327Z     
2025-05-07T20:31:34.1968562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:34.1968889Z             op = silu_mul_quant
2025-05-07T20:31:34.1969144Z             if compiled:
2025-05-07T20:31:34.1969390Z                 op = torch.compile(op)
2025-05-07T20:31:34.1969693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.1969977Z     
2025-05-07T20:31:34.1970169Z         y_fp8, y_scale = fn()
2025-05-07T20:31:34.1970466Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:34.1970765Z     
2025-05-07T20:31:34.1971022Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:34.1971369Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:34.1971666Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:34.1971987Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:34.1972345Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:34.1972662Z     
2025-05-07T20:31:34.1972870Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:34.1973068Z 
2025-05-07T20:31:34.1973171Z moe/activation_test.py:126: 
2025-05-07T20:31:34.1973476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.1973818Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:34.1974150Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:34.1974937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:34.1975999Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:34.1976550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:34.1977230Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:34.1977928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:34.1978659Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:34.1979419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:34.1980164Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:34.1980896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:34.1981545Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:34.1982162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:34.1982676Z     fn()
2025-05-07T20:31:34.1983188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:34.1983772Z     self.fn.run(
2025-05-07T20:31:34.1984237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:34.1984769Z     kernel = self.compile(
2025-05-07T20:31:34.1985314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:34.1985971Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.1986368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.1986602Z 
2025-05-07T20:31:34.1986904Z self = <triton.compiler.compiler.ASTSource object at 0x7f090df2e090>
2025-05-07T20:31:34.1987987Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:34.1989366Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f092455a5c0>}
2025-05-07T20:31:34.1990692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:34.1991720Z context = <triton._C.libtriton.ir.context object at 0x7f090df31730>
2025-05-07T20:31:34.1992013Z 
2025-05-07T20:31:34.1992188Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:34.1992723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.1993186Z                            module_map=module_map)
2025-05-07T20:31:34.1993555Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.1993920Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:34.1994193Z E       ^
2025-05-07T20:31:34.1994658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.1995121Z 
2025-05-07T20:31:34.1995540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:34.1996057Z 
2025-05-07T20:31:34.1996172Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:34.1996579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:34.1997085Z     T=128,
2025-05-07T20:31:34.1997367Z     D=5120,
2025-05-07T20:31:34.1997662Z     scale_ub=None,
2025-05-07T20:31:34.1997875Z     contiguous=True,
2025-05-07T20:31:34.1998102Z     compiled=True,
2025-05-07T20:31:34.1998313Z )
2025-05-07T20:31:34.5262495Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.5263593Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:34.5264943Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.5266467Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.5267455Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.5268765Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.5270155Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.5271145Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.5272759Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.5274154Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.5275215Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.5276493Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.5277749Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:34.5278970Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.5280178Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:34.5281004Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.5282027Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.5283041Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:34.5283970Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.5285352Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.5286636Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.5287749Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.5288784Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:34.5289964Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.5291328Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.5292389Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.5293304Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.5294038Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:34.5295132Z W0507 20:31:34.523000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.6157461Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.6158542Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:34.6159906Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.6161345Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.6162353Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6163915Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.6165305Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.6166298Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6167530Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.6169251Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.6170317Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6171600Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.6172850Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:34.6174068Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.6175283Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:34.6176111Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6177140Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.6178161Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:34.6178951Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.6180315Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.6181613Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.6182736Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.6183777Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:34.6184971Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.6186336Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.6187399Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.6188509Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.6189247Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:34.6190274Z W0507 20:31:34.613000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.8813725Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.8815216Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:34.8816563Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.8818116Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.8819108Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8820425Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.8821823Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.8822823Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8824069Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.8825618Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.8826705Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8828002Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.8829266Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:34.8830500Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.8831719Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:34.8832565Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8833602Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.8834638Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:34.8835448Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.8836670Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.8838274Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.8839721Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.8840786Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:34.8841983Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.8843552Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.8844635Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.8845612Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.8846372Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:34.8847396Z W0507 20:31:34.879000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.8953287Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.8954565Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:34.8955981Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.8957458Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.8958430Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8960878Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.8962282Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.8963273Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8964687Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.8966068Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.8967150Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8970007Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.8971267Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:34.8972497Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.8973707Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:34.8974549Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8975602Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.8976661Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:34.8977462Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.8978668Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.8980234Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.8981406Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.8982464Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:34.8983646Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.8985008Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.8986078Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.8986998Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.8987742Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:34.8988763Z W0507 20:31:34.893000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.1330424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.1331114Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:35.1331393Z 
2025-05-07T20:31:35.1331485Z     @given(
2025-05-07T20:31:35.1331721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.1332071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.1332766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.1333098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.1333434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.1333729Z     )
2025-05-07T20:31:35.1334086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.1334535Z     def test_silu_mul_quant(
2025-05-07T20:31:35.1334791Z         self,
2025-05-07T20:31:35.1334997Z         T: int,
2025-05-07T20:31:35.1335189Z         D: int,
2025-05-07T20:31:35.1335415Z         scale_ub: Optional[float],
2025-05-07T20:31:35.1335695Z         contiguous: bool,
2025-05-07T20:31:35.1335939Z         compiled: bool,
2025-05-07T20:31:35.1336167Z     ) -> None:
2025-05-07T20:31:35.1336388Z         torch.manual_seed(2025)
2025-05-07T20:31:35.1336625Z     
2025-05-07T20:31:35.1336918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.1337285Z     
2025-05-07T20:31:35.1337475Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.1337773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.1338086Z         x = x_sign * x_clamp
2025-05-07T20:31:35.1338320Z         x0 = x[:, :D]
2025-05-07T20:31:35.1338902Z         x1 = x[:, D:]
2025-05-07T20:31:35.1339117Z     
2025-05-07T20:31:35.1339298Z         if contiguous:
2025-05-07T20:31:35.1339533Z             x0 = x0.contiguous()
2025-05-07T20:31:35.1339791Z             x1 = x1.contiguous()
2025-05-07T20:31:35.1340032Z     
2025-05-07T20:31:35.1340218Z         if scale_ub is not None:
2025-05-07T20:31:35.1340491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.1340829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.1341134Z             )
2025-05-07T20:31:35.1341329Z         else:
2025-05-07T20:31:35.1341539Z             scale_ub_tensor = None
2025-05-07T20:31:35.1341781Z     
2025-05-07T20:31:35.1342185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.1342515Z             op = silu_mul_quant
2025-05-07T20:31:35.1342760Z             if compiled:
2025-05-07T20:31:35.1343012Z                 op = torch.compile(op)
2025-05-07T20:31:35.1343315Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.1343585Z     
2025-05-07T20:31:35.1343783Z         y_fp8, y_scale = fn()
2025-05-07T20:31:35.1344074Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:35.1344355Z     
2025-05-07T20:31:35.1344593Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.1344947Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:35.1345245Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:35.1345729Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:35.1346096Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:35.1346404Z     
2025-05-07T20:31:35.1346613Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:35.1346812Z 
2025-05-07T20:31:35.1346923Z moe/activation_test.py:126: 
2025-05-07T20:31:35.1347217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.1347561Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:35.1347891Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:35.1348681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:35.1349446Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:35.1349997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.1350682Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.1351368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:35.1352239Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:35.1353000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:35.1353755Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:35.1354484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:35.1355128Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:35.1355735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:35.1356264Z     fn()
2025-05-07T20:31:35.1356770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:35.1357355Z     self.fn.run(
2025-05-07T20:31:35.1357833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.1358368Z     kernel = self.compile(
2025-05-07T20:31:35.1358919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.1359583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.1359981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.1360211Z 
2025-05-07T20:31:35.1360422Z self = <triton.compiler.compiler.ASTSource object at 0x7f090d71c6d0>
2025-05-07T20:31:35.1361504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.1363009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0924558400>}
2025-05-07T20:31:35.1364533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.1365561Z context = <triton._C.libtriton.ir.context object at 0x7f090d71fd70>
2025-05-07T20:31:35.1365856Z 
2025-05-07T20:31:35.1366025Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.1366549Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.1367024Z                            module_map=module_map)
2025-05-07T20:31:35.1367389Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.1367755Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:35.1368030Z E       ^
2025-05-07T20:31:35.1368501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.1368968Z 
2025-05-07T20:31:35.1369387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.1369914Z 
2025-05-07T20:31:35.1370018Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.1370438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.1370838Z     T=4096,
2025-05-07T20:31:35.1371039Z     D=5120,
2025-05-07T20:31:35.1371233Z     scale_ub=None,
2025-05-07T20:31:35.1371447Z     contiguous=True,
2025-05-07T20:31:35.1371679Z     compiled=True,
2025-05-07T20:31:35.1371891Z )
2025-05-07T20:31:35.4684509Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:35.4686364Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:35.4689516Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:35.4692375Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:35.4694309Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.4696306Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:35.4697696Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.4698675Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.4699903Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:35.4701276Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.4702484Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.4703775Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:35.4705027Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:35.4706253Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:35.4707455Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:35.4708290Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.4709319Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:35.4710344Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:35.4711137Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:35.4712338Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:35.4713626Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:35.4714824Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:35.4715865Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:35.4717036Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:35.4718387Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:35.4719446Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.4720365Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.4721103Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:35.4722112Z W0507 20:31:35.466000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.5564526Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:35.5566096Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:35.5567837Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:35.5569300Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:35.5570268Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.5571573Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:35.5572958Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.5573940Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.5575166Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:35.5576553Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.5577609Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.5578888Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:35.5580289Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:35.5581510Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:35.5582724Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:35.5583545Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.5584576Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:35.5585599Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:35.5586394Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:35.5587600Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:35.5588889Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:35.5590085Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:35.5591138Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:35.5592321Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:35.5593677Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:35.5594736Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.5595657Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.5596457Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:35.5597494Z W0507 20:31:35.554000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.8184592Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:35.8185666Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:35.8187084Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:35.8188949Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:35.8189939Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.8191247Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:35.8192621Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.8193608Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.8194837Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:35.8196272Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.8197327Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.8198740Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:35.8199999Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:35.8210490Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:35.8211735Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:35.8212569Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.8213617Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:35.8214646Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:35.8215436Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:35.8216708Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:35.8217999Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:35.8219119Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:35.8220959Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:35.8222134Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:35.8223494Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:35.8224554Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.8225469Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.8226268Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:35.8227286Z W0507 20:31:35.816000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.8329581Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:35.8330887Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:35.8332588Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:35.8334590Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:35.8335613Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.8336939Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:35.8338327Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.8339657Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.8340907Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:35.8342285Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.8343349Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.8344635Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:35.8345943Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:35.8347355Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:35.8348563Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:35.8349399Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:35.8350443Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:35.8351472Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:35.8352275Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:35.8353483Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:35.8354775Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:35.8355898Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:35.8356958Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:35.8358260Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:35.8359621Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:35.8360686Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.8361601Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.8362345Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:35.8363612Z W0507 20:31:35.830000 86685 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.0717911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.0718524Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:36.0718809Z 
2025-05-07T20:31:36.0718911Z     @given(
2025-05-07T20:31:36.0719162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.0719505Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.0719849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.0720191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.0720541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.0720865Z     )
2025-05-07T20:31:36.0721240Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.0721730Z     def test_silu_mul_quant(
2025-05-07T20:31:36.0722530Z         self,
2025-05-07T20:31:36.0722754Z         T: int,
2025-05-07T20:31:36.0722961Z         D: int,
2025-05-07T20:31:36.0723202Z         scale_ub: Optional[float],
2025-05-07T20:31:36.0723669Z         contiguous: bool,
2025-05-07T20:31:36.0723918Z         compiled: bool,
2025-05-07T20:31:36.0724165Z     ) -> None:
2025-05-07T20:31:36.0724396Z         torch.manual_seed(2025)
2025-05-07T20:31:36.0724645Z     
2025-05-07T20:31:36.0724933Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.0725305Z     
2025-05-07T20:31:36.0725499Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.0725805Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.0726124Z         x = x_sign * x_clamp
2025-05-07T20:31:36.0726373Z         x0 = x[:, :D]
2025-05-07T20:31:36.0726589Z         x1 = x[:, D:]
2025-05-07T20:31:36.0726812Z     
2025-05-07T20:31:36.0727010Z         if contiguous:
2025-05-07T20:31:36.0727247Z             x0 = x0.contiguous()
2025-05-07T20:31:36.0727526Z             x1 = x1.contiguous()
2025-05-07T20:31:36.0727776Z     
2025-05-07T20:31:36.0727966Z         if scale_ub is not None:
2025-05-07T20:31:36.0728243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.0728586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.0728901Z             )
2025-05-07T20:31:36.0729099Z         else:
2025-05-07T20:31:36.0729316Z             scale_ub_tensor = None
2025-05-07T20:31:36.0729566Z     
2025-05-07T20:31:36.0729801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.0730114Z             op = silu_mul_quant
2025-05-07T20:31:36.0730357Z             if compiled:
2025-05-07T20:31:36.0730604Z                 op = torch.compile(op)
2025-05-07T20:31:36.0730908Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.0731175Z     
2025-05-07T20:31:36.0731369Z         y_fp8, y_scale = fn()
2025-05-07T20:31:36.0731836Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:36.0732141Z     
2025-05-07T20:31:36.0732374Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.0732707Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:36.0733003Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:36.0733313Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:36.0733674Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.0733992Z     
2025-05-07T20:31:36.0734190Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:36.0734391Z 
2025-05-07T20:31:36.0734494Z moe/activation_test.py:126: 
2025-05-07T20:31:36.0734793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.0735136Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:36.0735460Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.0736260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:36.0737023Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:36.0737569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.0738255Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.0739319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:36.0740060Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.0740815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:36.0741572Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.0742320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:36.0743121Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:36.0743724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:36.0744250Z     fn()
2025-05-07T20:31:36.0744767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:36.0745347Z     self.fn.run(
2025-05-07T20:31:36.0745874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.0746415Z     kernel = self.compile(
2025-05-07T20:31:36.0746964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.0747621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.0748025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.0748263Z 
2025-05-07T20:31:36.0748479Z self = <triton.compiler.compiler.ASTSource object at 0x7f090cf0f290>
2025-05-07T20:31:36.0749563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.0750950Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090db2e7a0>}
2025-05-07T20:31:36.0752305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.0753338Z context = <triton._C.libtriton.ir.context object at 0x7f090cf169b0>
2025-05-07T20:31:36.0753625Z 
2025-05-07T20:31:36.0753930Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.0754459Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.0754936Z                            module_map=module_map)
2025-05-07T20:31:36.0755313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.0755679Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:36.0755948Z E       ^
2025-05-07T20:31:36.0756419Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.0756871Z 
2025-05-07T20:31:36.0757300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.0757815Z 
2025-05-07T20:31:36.0757921Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.0758348Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.0758759Z     T=16384,
2025-05-07T20:31:36.0758952Z     D=5120,
2025-05-07T20:31:36.0759141Z     scale_ub=None,
2025-05-07T20:31:36.0759357Z     contiguous=True,
2025-05-07T20:31:36.0759583Z     compiled=True,
2025-05-07T20:31:36.0759791Z )
2025-05-07T20:31:36.1037078Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:36.1038338Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:36.1039949Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:36.1040960Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:36.1042416Z W0507 20:31:36.102000 86685 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:36.1724109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.1724895Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:36.1725285Z 
2025-05-07T20:31:36.1725405Z     @given(
2025-05-07T20:31:36.1725653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.1726004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.1726316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.1726650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.1726979Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.1727272Z     )
2025-05-07T20:31:36.1727645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.1728100Z     def test_silu_mul_quant(
2025-05-07T20:31:36.1728343Z         self,
2025-05-07T20:31:36.1728545Z         T: int,
2025-05-07T20:31:36.1728746Z         D: int,
2025-05-07T20:31:36.1728974Z         scale_ub: Optional[float],
2025-05-07T20:31:36.1729246Z         contiguous: bool,
2025-05-07T20:31:36.1729483Z         compiled: bool,
2025-05-07T20:31:36.1729715Z     ) -> None:
2025-05-07T20:31:36.1729932Z         torch.manual_seed(2025)
2025-05-07T20:31:36.1730170Z     
2025-05-07T20:31:36.1730450Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.1730799Z     
2025-05-07T20:31:36.1730995Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.1731292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.1731611Z         x = x_sign * x_clamp
2025-05-07T20:31:36.1731856Z         x0 = x[:, :D]
2025-05-07T20:31:36.1732067Z         x1 = x[:, D:]
2025-05-07T20:31:36.1732643Z     
2025-05-07T20:31:36.1732840Z         if contiguous:
2025-05-07T20:31:36.1733070Z             x0 = x0.contiguous()
2025-05-07T20:31:36.1733327Z             x1 = x1.contiguous()
2025-05-07T20:31:36.1733568Z     
2025-05-07T20:31:36.1733754Z         if scale_ub is not None:
2025-05-07T20:31:36.1734031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.1734369Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.1734674Z             )
2025-05-07T20:31:36.1734868Z         else:
2025-05-07T20:31:36.1735084Z             scale_ub_tensor = None
2025-05-07T20:31:36.1735331Z     
2025-05-07T20:31:36.1735567Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.1735880Z             op = silu_mul_quant
2025-05-07T20:31:36.1736122Z             if compiled:
2025-05-07T20:31:36.1736370Z                 op = torch.compile(op)
2025-05-07T20:31:36.1736671Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.1736950Z     
2025-05-07T20:31:36.1737151Z         y_fp8, y_scale = fn()
2025-05-07T20:31:36.1737441Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:36.1737734Z     
2025-05-07T20:31:36.1737967Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.1738301Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:36.1738884Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:36.1739194Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:36.1739556Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.1739872Z     
2025-05-07T20:31:36.1740069Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:36.1740271Z 
2025-05-07T20:31:36.1740372Z moe/activation_test.py:126: 
2025-05-07T20:31:36.1740667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.1741003Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:36.1741335Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.1742294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:36.1743053Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:36.1743598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.1744294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.1744998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:36.1745728Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.1746484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:36.1747252Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.1747996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:36.1748646Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:36.1749246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:36.1749774Z     fn()
2025-05-07T20:31:36.1750290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:36.1750869Z     self.fn.run(
2025-05-07T20:31:36.1751342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.1751885Z     kernel = self.compile(
2025-05-07T20:31:36.1752432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.1753211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.1753623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.1753853Z 
2025-05-07T20:31:36.1754073Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c94dd90>
2025-05-07T20:31:36.1755148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.1756574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d9de200>}
2025-05-07T20:31:36.1757910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.1758951Z context = <triton._C.libtriton.ir.context object at 0x7f090c9514f0>
2025-05-07T20:31:36.1759240Z 
2025-05-07T20:31:36.1759416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.1759936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.1760407Z                            module_map=module_map)
2025-05-07T20:31:36.1760776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.1761137Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:36.1761398Z E       ^
2025-05-07T20:31:36.1761867Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.1762323Z 
2025-05-07T20:31:36.1762753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.1763269Z 
2025-05-07T20:31:36.1763456Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.1764000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.1764404Z     T=1,
2025-05-07T20:31:36.1764587Z     D=5120,
2025-05-07T20:31:36.1764784Z     scale_ub=1200.0,
2025-05-07T20:31:36.1765013Z     contiguous=True,
2025-05-07T20:31:36.1765232Z     compiled=True,
2025-05-07T20:31:36.1765445Z )
2025-05-07T20:31:36.4914223Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.4915665Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:36.4916271Z 
2025-05-07T20:31:36.4916400Z     @given(
2025-05-07T20:31:36.4916666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.4916985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.4917294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.4917629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.4917982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.4918283Z     )
2025-05-07T20:31:36.4918643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.4919090Z     def test_silu_mul_quant(
2025-05-07T20:31:36.4919335Z         self,
2025-05-07T20:31:36.4919532Z         T: int,
2025-05-07T20:31:36.4919736Z         D: int,
2025-05-07T20:31:36.4919963Z         scale_ub: Optional[float],
2025-05-07T20:31:36.4920235Z         contiguous: bool,
2025-05-07T20:31:36.4920479Z         compiled: bool,
2025-05-07T20:31:36.4920714Z     ) -> None:
2025-05-07T20:31:36.4920926Z         torch.manual_seed(2025)
2025-05-07T20:31:36.4921175Z     
2025-05-07T20:31:36.4921457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.4921797Z     
2025-05-07T20:31:36.4921997Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.4922296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.4922608Z         x = x_sign * x_clamp
2025-05-07T20:31:36.4923181Z         x0 = x[:, :D]
2025-05-07T20:31:36.4923538Z         x1 = x[:, D:]
2025-05-07T20:31:36.4923747Z     
2025-05-07T20:31:36.4923942Z         if contiguous:
2025-05-07T20:31:36.4924181Z             x0 = x0.contiguous()
2025-05-07T20:31:36.4924440Z             x1 = x1.contiguous()
2025-05-07T20:31:36.4924685Z     
2025-05-07T20:31:36.4924883Z         if scale_ub is not None:
2025-05-07T20:31:36.4925158Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.4925494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.4925808Z             )
2025-05-07T20:31:36.4926004Z         else:
2025-05-07T20:31:36.4926215Z             scale_ub_tensor = None
2025-05-07T20:31:36.4926470Z     
2025-05-07T20:31:36.4926708Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.4927019Z             op = silu_mul_quant
2025-05-07T20:31:36.4927272Z             if compiled:
2025-05-07T20:31:36.4927519Z                 op = torch.compile(op)
2025-05-07T20:31:36.4927831Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4928110Z     
2025-05-07T20:31:36.4928308Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.4928473Z 
2025-05-07T20:31:36.4928575Z moe/activation_test.py:117: 
2025-05-07T20:31:36.4928877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4929217Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.4929509Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4930071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:36.4930642Z     return fn(*args, **kwargs)
2025-05-07T20:31:36.4931312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.4932001Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.4932558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.4933480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.4934152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.4934687Z     kernel = self.compile(
2025-05-07T20:31:36.4935236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.4935904Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.4936310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4936545Z 
2025-05-07T20:31:36.4936754Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c9c7250>
2025-05-07T20:31:36.4937844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.4939520Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d451d00>}
2025-05-07T20:31:36.4940873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.4941901Z context = <triton._C.libtriton.ir.context object at 0x7f090c97f030>
2025-05-07T20:31:36.4942196Z 
2025-05-07T20:31:36.4942366Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.4942896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.4943368Z                            module_map=module_map)
2025-05-07T20:31:36.4943862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.4944231Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.4944495Z E       ^
2025-05-07T20:31:36.4944962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.4945425Z 
2025-05-07T20:31:36.4945847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.4946413Z 
2025-05-07T20:31:36.4946532Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.4946953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.4947355Z     T=1,
2025-05-07T20:31:36.4947546Z     D=5120,
2025-05-07T20:31:36.4947748Z     scale_ub=None,
2025-05-07T20:31:36.4947961Z     contiguous=False,
2025-05-07T20:31:36.4948191Z     compiled=True,
2025-05-07T20:31:36.4948403Z )
2025-05-07T20:31:36.5450303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.5451077Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:36.5451463Z 
2025-05-07T20:31:36.5451578Z     @given(
2025-05-07T20:31:36.5451904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.5452339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.5452740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.5453082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.5453419Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.5453702Z     )
2025-05-07T20:31:36.5454055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.5454505Z     def test_silu_mul_quant(
2025-05-07T20:31:36.5454744Z         self,
2025-05-07T20:31:36.5454946Z         T: int,
2025-05-07T20:31:36.5455168Z         D: int,
2025-05-07T20:31:36.5455385Z         scale_ub: Optional[float],
2025-05-07T20:31:36.5455670Z         contiguous: bool,
2025-05-07T20:31:36.5456107Z         compiled: bool,
2025-05-07T20:31:36.5456333Z     ) -> None:
2025-05-07T20:31:36.5456558Z         torch.manual_seed(2025)
2025-05-07T20:31:36.5456808Z     
2025-05-07T20:31:36.5457086Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.5457437Z     
2025-05-07T20:31:36.5457643Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.5457944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.5458256Z         x = x_sign * x_clamp
2025-05-07T20:31:36.5458506Z         x0 = x[:, :D]
2025-05-07T20:31:36.5458729Z         x1 = x[:, D:]
2025-05-07T20:31:36.5458940Z     
2025-05-07T20:31:36.5469223Z         if contiguous:
2025-05-07T20:31:36.5469522Z             x0 = x0.contiguous()
2025-05-07T20:31:36.5469786Z             x1 = x1.contiguous()
2025-05-07T20:31:36.5470032Z     
2025-05-07T20:31:36.5470241Z         if scale_ub is not None:
2025-05-07T20:31:36.5470515Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.5470881Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.5471200Z             )
2025-05-07T20:31:36.5471412Z         else:
2025-05-07T20:31:36.5471621Z             scale_ub_tensor = None
2025-05-07T20:31:36.5471881Z     
2025-05-07T20:31:36.5472130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.5472448Z             op = silu_mul_quant
2025-05-07T20:31:36.5472714Z             if compiled:
2025-05-07T20:31:36.5472969Z                 op = torch.compile(op)
2025-05-07T20:31:36.5473267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.5473551Z     
2025-05-07T20:31:36.5473750Z         y_fp8, y_scale = fn()
2025-05-07T20:31:36.5474036Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:36.5474336Z     
2025-05-07T20:31:36.5474583Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.5474919Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:36.5475380Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:36.5475711Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:36.5476076Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.5476386Z     
2025-05-07T20:31:36.5476596Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:36.5476790Z 
2025-05-07T20:31:36.5476903Z moe/activation_test.py:126: 
2025-05-07T20:31:36.5477203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.5477547Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:36.5477884Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.5478677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:36.5479448Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:36.5480015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.5480716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.5481411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:36.5482142Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.5482906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:36.5483773Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.5484506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:36.5485156Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:36.5485772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:36.5486418Z     fn()
2025-05-07T20:31:36.5486950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:36.5487539Z     self.fn.run(
2025-05-07T20:31:36.5488018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.5488596Z     kernel = self.compile(
2025-05-07T20:31:36.5489149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.5489821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.5490222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.5490468Z 
2025-05-07T20:31:36.5490678Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c19e090>
2025-05-07T20:31:36.5491772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.5493151Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d451260>}
2025-05-07T20:31:36.5494506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.5495540Z context = <triton._C.libtriton.ir.context object at 0x7f090c196c70>
2025-05-07T20:31:36.5495823Z 
2025-05-07T20:31:36.5495987Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.5496516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.5497069Z                            module_map=module_map)
2025-05-07T20:31:36.5497440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.5497805Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:36.5498082Z E       ^
2025-05-07T20:31:36.5498554Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.5499010Z 
2025-05-07T20:31:36.5499433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.5499960Z 
2025-05-07T20:31:36.5500064Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.5500486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.5500898Z     T=1,
2025-05-07T20:31:36.5501080Z     D=5120,
2025-05-07T20:31:36.5501280Z     scale_ub=None,
2025-05-07T20:31:36.5501502Z     contiguous=True,
2025-05-07T20:31:36.5501727Z     compiled=False,
2025-05-07T20:31:36.5501942Z )
2025-05-07T20:31:36.6674677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.6675281Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:36.6675661Z 
2025-05-07T20:31:36.6675779Z     @given(
2025-05-07T20:31:36.6676159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.6676588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.6677003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.6677364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.6677683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.6677963Z     )
2025-05-07T20:31:36.6678307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.6678744Z     def test_silu_mul_quant(
2025-05-07T20:31:36.6678982Z         self,
2025-05-07T20:31:36.6679179Z         T: int,
2025-05-07T20:31:36.6679368Z         D: int,
2025-05-07T20:31:36.6679783Z         scale_ub: Optional[float],
2025-05-07T20:31:36.6680054Z         contiguous: bool,
2025-05-07T20:31:36.6680284Z         compiled: bool,
2025-05-07T20:31:36.6680512Z     ) -> None:
2025-05-07T20:31:36.6680725Z         torch.manual_seed(2025)
2025-05-07T20:31:36.6680963Z     
2025-05-07T20:31:36.6681228Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.6681564Z     
2025-05-07T20:31:36.6681753Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.6682036Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.6682345Z         x = x_sign * x_clamp
2025-05-07T20:31:36.6682583Z         x0 = x[:, :D]
2025-05-07T20:31:36.6682794Z         x1 = x[:, D:]
2025-05-07T20:31:36.6683004Z     
2025-05-07T20:31:36.6683192Z         if contiguous:
2025-05-07T20:31:36.6683519Z             x0 = x0.contiguous()
2025-05-07T20:31:36.6683783Z             x1 = x1.contiguous()
2025-05-07T20:31:36.6684023Z     
2025-05-07T20:31:36.6684214Z         if scale_ub is not None:
2025-05-07T20:31:36.6684493Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.6684828Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.6685130Z             )
2025-05-07T20:31:36.6685337Z         else:
2025-05-07T20:31:36.6685561Z             scale_ub_tensor = None
2025-05-07T20:31:36.6685839Z     
2025-05-07T20:31:36.6686080Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.6686481Z             op = silu_mul_quant
2025-05-07T20:31:36.6686764Z             if compiled:
2025-05-07T20:31:36.6687027Z                 op = torch.compile(op)
2025-05-07T20:31:36.6687359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6687665Z     
2025-05-07T20:31:36.6687860Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.6688048Z 
2025-05-07T20:31:36.6688151Z moe/activation_test.py:117: 
2025-05-07T20:31:36.6688482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6688982Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.6689266Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6689958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.6690650Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.6691183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.6691868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.6692537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.6693066Z     kernel = self.compile(
2025-05-07T20:31:36.6693610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.6694275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.6694676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6694904Z 
2025-05-07T20:31:36.6695112Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c1a8250>
2025-05-07T20:31:36.6696189Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.6697555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090decff60>}
2025-05-07T20:31:36.6698885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.6699907Z context = <triton._C.libtriton.ir.context object at 0x7f090c17b730>
2025-05-07T20:31:36.6700274Z 
2025-05-07T20:31:36.6700440Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.6700962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.6701428Z                            module_map=module_map)
2025-05-07T20:31:36.6701789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6702142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.6702404Z E       ^
2025-05-07T20:31:36.6702872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.6703322Z 
2025-05-07T20:31:36.6703741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.6704262Z 
2025-05-07T20:31:36.6704364Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.6704783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.6705192Z     T=128,
2025-05-07T20:31:36.6705374Z     D=5120,
2025-05-07T20:31:36.6705567Z     scale_ub=None,
2025-05-07T20:31:36.6705788Z     contiguous=False,
2025-05-07T20:31:36.6706005Z     compiled=True,
2025-05-07T20:31:36.6706213Z )
2025-05-07T20:31:36.6706535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.6707019Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:36.6707294Z 
2025-05-07T20:31:36.6707371Z     @given(
2025-05-07T20:31:36.6707602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.6707907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.6708214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.6708543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.6708874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.6709241Z     )
2025-05-07T20:31:36.6709595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.6710036Z     def test_silu_mul_quant(
2025-05-07T20:31:36.6710272Z         self,
2025-05-07T20:31:36.6710457Z         T: int,
2025-05-07T20:31:36.6710647Z         D: int,
2025-05-07T20:31:36.6710855Z         scale_ub: Optional[float],
2025-05-07T20:31:36.6711125Z         contiguous: bool,
2025-05-07T20:31:36.6711362Z         compiled: bool,
2025-05-07T20:31:36.6711581Z     ) -> None:
2025-05-07T20:31:36.6711802Z         torch.manual_seed(2025)
2025-05-07T20:31:36.6712043Z     
2025-05-07T20:31:36.6712306Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.6712650Z     
2025-05-07T20:31:36.6712843Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.6713126Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.6713435Z         x = x_sign * x_clamp
2025-05-07T20:31:36.6713678Z         x0 = x[:, :D]
2025-05-07T20:31:36.6713894Z         x1 = x[:, D:]
2025-05-07T20:31:36.6714108Z     
2025-05-07T20:31:36.6714291Z         if contiguous:
2025-05-07T20:31:36.6714523Z             x0 = x0.contiguous()
2025-05-07T20:31:36.6714771Z             x1 = x1.contiguous()
2025-05-07T20:31:36.6715010Z     
2025-05-07T20:31:36.6715202Z         if scale_ub is not None:
2025-05-07T20:31:36.6715467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.6715798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.6716108Z             )
2025-05-07T20:31:36.6716311Z         else:
2025-05-07T20:31:36.6716558Z             scale_ub_tensor = None
2025-05-07T20:31:36.6716809Z     
2025-05-07T20:31:36.6717035Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.6717348Z             op = silu_mul_quant
2025-05-07T20:31:36.6717597Z             if compiled:
2025-05-07T20:31:36.6717838Z                 op = torch.compile(op)
2025-05-07T20:31:36.6718141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6718531Z     
2025-05-07T20:31:36.6718719Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.6718890Z 
2025-05-07T20:31:36.6718989Z moe/activation_test.py:117: 
2025-05-07T20:31:36.6719287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6719619Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.6719895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6720455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:36.6721019Z     return fn(*args, **kwargs)
2025-05-07T20:31:36.6721683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.6722376Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.6722927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.6723745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.6724407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.6724946Z     kernel = self.compile(
2025-05-07T20:31:36.6725494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.6726155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.6726545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6726781Z 
2025-05-07T20:31:36.6726988Z self = <triton.compiler.compiler.ASTSource object at 0x7f08fff964d0>
2025-05-07T20:31:36.6728149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.6729529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090decea20>}
2025-05-07T20:31:36.6730871Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.6731898Z context = <triton._C.libtriton.ir.context object at 0x7f08fffe2330>
2025-05-07T20:31:36.6732191Z 
2025-05-07T20:31:36.6732357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.6732881Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.6733341Z                            module_map=module_map)
2025-05-07T20:31:36.6733703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6734064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.6734318Z E       ^
2025-05-07T20:31:36.6734792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.6735249Z 
2025-05-07T20:31:36.6735667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.6736180Z 
2025-05-07T20:31:36.6736291Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.6736696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.6737100Z     T=128,
2025-05-07T20:31:36.6737288Z     D=7168,
2025-05-07T20:31:36.6737475Z     scale_ub=1200.0,
2025-05-07T20:31:36.6737697Z     contiguous=False,
2025-05-07T20:31:36.6737921Z     compiled=False,
2025-05-07T20:31:36.6738117Z )
2025-05-07T20:31:36.7616236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.7616816Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:36.7617401Z 
2025-05-07T20:31:36.7617532Z     @given(
2025-05-07T20:31:36.7617860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.7618318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.7618743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.7619128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.7619465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.7619758Z     )
2025-05-07T20:31:36.7620124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.7620565Z     def test_silu_mul_quant(
2025-05-07T20:31:36.7620813Z         self,
2025-05-07T20:31:36.7621018Z         T: int,
2025-05-07T20:31:36.7621221Z         D: int,
2025-05-07T20:31:36.7621447Z         scale_ub: Optional[float],
2025-05-07T20:31:36.7621729Z         contiguous: bool,
2025-05-07T20:31:36.7621977Z         compiled: bool,
2025-05-07T20:31:36.7622218Z     ) -> None:
2025-05-07T20:31:36.7622441Z         torch.manual_seed(2025)
2025-05-07T20:31:36.7622692Z     
2025-05-07T20:31:36.7622970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.7623309Z     
2025-05-07T20:31:36.7623515Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.7623812Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.7624130Z         x = x_sign * x_clamp
2025-05-07T20:31:36.7624367Z         x0 = x[:, :D]
2025-05-07T20:31:36.7624614Z         x1 = x[:, D:]
2025-05-07T20:31:36.7624830Z     
2025-05-07T20:31:36.7625023Z         if contiguous:
2025-05-07T20:31:36.7625256Z             x0 = x0.contiguous()
2025-05-07T20:31:36.7625520Z             x1 = x1.contiguous()
2025-05-07T20:31:36.7625767Z     
2025-05-07T20:31:36.7625963Z         if scale_ub is not None:
2025-05-07T20:31:36.7626233Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.7626738Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.7627064Z             )
2025-05-07T20:31:36.7627260Z         else:
2025-05-07T20:31:36.7627480Z             scale_ub_tensor = None
2025-05-07T20:31:36.7627738Z     
2025-05-07T20:31:36.7627971Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.7628290Z             op = silu_mul_quant
2025-05-07T20:31:36.7628546Z             if compiled:
2025-05-07T20:31:36.7628791Z                 op = torch.compile(op)
2025-05-07T20:31:36.7629095Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.7629372Z     
2025-05-07T20:31:36.7629568Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.7629739Z 
2025-05-07T20:31:36.7629839Z moe/activation_test.py:117: 
2025-05-07T20:31:36.7630140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.7630479Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.7630762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.7631469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.7632173Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.7632712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.7633404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.7634081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.7634621Z     kernel = self.compile(
2025-05-07T20:31:36.7635167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.7635835Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.7636240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.7636564Z 
2025-05-07T20:31:36.7636782Z self = <triton.compiler.compiler.ASTSource object at 0x7f08fff8e650>
2025-05-07T20:31:36.7637853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.7639406Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d4525c0>}
2025-05-07T20:31:36.7640749Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.7641777Z context = <triton._C.libtriton.ir.context object at 0x7f08fff32530>
2025-05-07T20:31:36.7642066Z 
2025-05-07T20:31:36.7642240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.7642773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.7643244Z                            module_map=module_map)
2025-05-07T20:31:36.7643699Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.7644049Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.7644311Z E       ^
2025-05-07T20:31:36.7644788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.7645240Z 
2025-05-07T20:31:36.7645663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.7646188Z 
2025-05-07T20:31:36.7646293Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.7646709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.7647118Z     T=128,
2025-05-07T20:31:36.7647433Z     D=5120,
2025-05-07T20:31:36.7647634Z     scale_ub=None,
2025-05-07T20:31:36.7647851Z     contiguous=False,
2025-05-07T20:31:36.7648077Z     compiled=False,
2025-05-07T20:31:36.7648282Z )
2025-05-07T20:31:36.7648606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.7649094Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.7649369Z 
2025-05-07T20:31:36.7649450Z     @given(
2025-05-07T20:31:36.7649683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.7650006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.7650308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.7650635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.7650973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.7651255Z     )
2025-05-07T20:31:36.7651609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.7652055Z     def test_silu_mul_quant(
2025-05-07T20:31:36.7652297Z         self,
2025-05-07T20:31:36.7652494Z         T: int,
2025-05-07T20:31:36.7652698Z         D: int,
2025-05-07T20:31:36.7652911Z         scale_ub: Optional[float],
2025-05-07T20:31:36.7653190Z         contiguous: bool,
2025-05-07T20:31:36.7653432Z         compiled: bool,
2025-05-07T20:31:36.7653652Z     ) -> None:
2025-05-07T20:31:36.7653872Z         torch.manual_seed(2025)
2025-05-07T20:31:36.7654118Z     
2025-05-07T20:31:36.7654393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.7654730Z     
2025-05-07T20:31:36.7654928Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.7655223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.7655532Z         x = x_sign * x_clamp
2025-05-07T20:31:36.7655774Z         x0 = x[:, :D]
2025-05-07T20:31:36.7655996Z         x1 = x[:, D:]
2025-05-07T20:31:36.7656200Z     
2025-05-07T20:31:36.7656398Z         if contiguous:
2025-05-07T20:31:36.7656760Z             x0 = x0.contiguous()
2025-05-07T20:31:36.7657015Z             x1 = x1.contiguous()
2025-05-07T20:31:36.7657267Z     
2025-05-07T20:31:36.7657462Z         if scale_ub is not None:
2025-05-07T20:31:36.7657731Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.7658069Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.7658382Z             )
2025-05-07T20:31:36.7658575Z         else:
2025-05-07T20:31:36.7658786Z             scale_ub_tensor = None
2025-05-07T20:31:36.7659044Z     
2025-05-07T20:31:36.7659274Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.7659587Z             op = silu_mul_quant
2025-05-07T20:31:36.7659841Z             if compiled:
2025-05-07T20:31:36.7660088Z                 op = torch.compile(op)
2025-05-07T20:31:36.7660381Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.7660655Z     
2025-05-07T20:31:36.7660848Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.7661027Z 
2025-05-07T20:31:36.7661126Z moe/activation_test.py:117: 
2025-05-07T20:31:36.7661418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.7661751Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.7662028Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.7662730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.7663716Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.7664354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.7665193Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.7665989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.7666618Z     kernel = self.compile(
2025-05-07T20:31:36.7667399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.7668172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.7668648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.7668927Z 
2025-05-07T20:31:36.7669215Z self = <triton.compiler.compiler.ASTSource object at 0x7f08fffdd210>
2025-05-07T20:31:36.7670405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.7680161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d38b7e0>}
2025-05-07T20:31:36.7681535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.7682580Z context = <triton._C.libtriton.ir.context object at 0x7f08fff810f0>
2025-05-07T20:31:36.7682872Z 
2025-05-07T20:31:36.7683050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.7683685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.7684161Z                            module_map=module_map)
2025-05-07T20:31:36.7684534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.7684897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.7685153Z E       ^
2025-05-07T20:31:36.7685627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.7686085Z 
2025-05-07T20:31:36.7686518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.7687152Z 
2025-05-07T20:31:36.7687266Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.7687680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.7688093Z     T=128,
2025-05-07T20:31:36.7688283Z     D=5120,
2025-05-07T20:31:36.7688476Z     scale_ub=1200.0,
2025-05-07T20:31:36.7688707Z     contiguous=True,
2025-05-07T20:31:36.7688936Z     compiled=False,
2025-05-07T20:31:36.7689139Z )
2025-05-07T20:31:37.1106299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.1106816Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:37.1107213Z 
2025-05-07T20:31:37.1107334Z     @given(
2025-05-07T20:31:37.1107661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.1108095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.1108515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.1108906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.1109235Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.1109518Z     )
2025-05-07T20:31:37.1109866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.1110307Z     def test_silu_mul_quant(
2025-05-07T20:31:37.1110549Z         self,
2025-05-07T20:31:37.1110745Z         T: int,
2025-05-07T20:31:37.1110941Z         D: int,
2025-05-07T20:31:37.1111160Z         scale_ub: Optional[float],
2025-05-07T20:31:37.1111428Z         contiguous: bool,
2025-05-07T20:31:37.1111670Z         compiled: bool,
2025-05-07T20:31:37.1111903Z     ) -> None:
2025-05-07T20:31:37.1112117Z         torch.manual_seed(2025)
2025-05-07T20:31:37.1112360Z     
2025-05-07T20:31:37.1112637Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.1112977Z     
2025-05-07T20:31:37.1113339Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.1113638Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.1113940Z         x = x_sign * x_clamp
2025-05-07T20:31:37.1114178Z         x0 = x[:, :D]
2025-05-07T20:31:37.1114398Z         x1 = x[:, D:]
2025-05-07T20:31:37.1114602Z     
2025-05-07T20:31:37.1114798Z         if contiguous:
2025-05-07T20:31:37.1115039Z             x0 = x0.contiguous()
2025-05-07T20:31:37.1115288Z             x1 = x1.contiguous()
2025-05-07T20:31:37.1115530Z     
2025-05-07T20:31:37.1115726Z         if scale_ub is not None:
2025-05-07T20:31:37.1115991Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.1116330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.1116642Z             )
2025-05-07T20:31:37.1116839Z         else:
2025-05-07T20:31:37.1117044Z             scale_ub_tensor = None
2025-05-07T20:31:37.1117295Z     
2025-05-07T20:31:37.1117530Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.1117847Z             op = silu_mul_quant
2025-05-07T20:31:37.1118098Z             if compiled:
2025-05-07T20:31:37.1118343Z                 op = torch.compile(op)
2025-05-07T20:31:37.1118634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.1118910Z     
2025-05-07T20:31:37.1119104Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.1119266Z 
2025-05-07T20:31:37.1119369Z moe/activation_test.py:117: 
2025-05-07T20:31:37.1119653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.1119980Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.1120258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.1120943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.1121639Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.1122182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.1122994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.1123838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.1124374Z     kernel = self.compile(
2025-05-07T20:31:37.1124919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.1125576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.1125968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.1126200Z 
2025-05-07T20:31:37.1126405Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c237b90>
2025-05-07T20:31:37.1127487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.1128855Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c6dea20>}
2025-05-07T20:31:37.1130190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.1131216Z context = <triton._C.libtriton.ir.context object at 0x7f090c2a3a70>
2025-05-07T20:31:37.1131504Z 
2025-05-07T20:31:37.1131668Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.1132186Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.1132649Z                            module_map=module_map)
2025-05-07T20:31:37.1133101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.1133459Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.1133710Z E       ^
2025-05-07T20:31:37.1134173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.1134629Z 
2025-05-07T20:31:37.1135047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.1135562Z 
2025-05-07T20:31:37.1135667Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.1136068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.1136469Z     T=1,
2025-05-07T20:31:37.1136649Z     D=7168,
2025-05-07T20:31:37.1136833Z     scale_ub=1200.0,
2025-05-07T20:31:37.1137052Z     contiguous=True,
2025-05-07T20:31:37.1137271Z     compiled=True,
2025-05-07T20:31:37.1137464Z )
2025-05-07T20:31:37.1137777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.1138267Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:37.1138784Z 
2025-05-07T20:31:37.1138870Z     @given(
2025-05-07T20:31:37.1139094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.1139407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.1139707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.1140024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.1140348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.1140629Z     )
2025-05-07T20:31:37.1140970Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.1141410Z     def test_silu_mul_quant(
2025-05-07T20:31:37.1141649Z         self,
2025-05-07T20:31:37.1141843Z         T: int,
2025-05-07T20:31:37.1142031Z         D: int,
2025-05-07T20:31:37.1142244Z         scale_ub: Optional[float],
2025-05-07T20:31:37.1142511Z         contiguous: bool,
2025-05-07T20:31:37.1142894Z         compiled: bool,
2025-05-07T20:31:37.1143113Z     ) -> None:
2025-05-07T20:31:37.1143326Z         torch.manual_seed(2025)
2025-05-07T20:31:37.1143557Z     
2025-05-07T20:31:37.1143825Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.1144162Z     
2025-05-07T20:31:37.1144345Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.1144631Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.1144934Z         x = x_sign * x_clamp
2025-05-07T20:31:37.1145164Z         x0 = x[:, :D]
2025-05-07T20:31:37.1145376Z         x1 = x[:, D:]
2025-05-07T20:31:37.1145583Z     
2025-05-07T20:31:37.1145763Z         if contiguous:
2025-05-07T20:31:37.1146032Z             x0 = x0.contiguous()
2025-05-07T20:31:37.1146290Z             x1 = x1.contiguous()
2025-05-07T20:31:37.1146519Z     
2025-05-07T20:31:37.1146705Z         if scale_ub is not None:
2025-05-07T20:31:37.1146974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.1147309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.1147614Z             )
2025-05-07T20:31:37.1147805Z         else:
2025-05-07T20:31:37.1148006Z             scale_ub_tensor = None
2025-05-07T20:31:37.1148248Z     
2025-05-07T20:31:37.1148478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.1148788Z             op = silu_mul_quant
2025-05-07T20:31:37.1149025Z             if compiled:
2025-05-07T20:31:37.1149272Z                 op = torch.compile(op)
2025-05-07T20:31:37.1149564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.1149827Z     
2025-05-07T20:31:37.1150018Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.1150177Z 
2025-05-07T20:31:37.1150276Z moe/activation_test.py:117: 
2025-05-07T20:31:37.1150567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.1150889Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.1151163Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.1151846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:37.1152405Z     return fn(*args, **kwargs)
2025-05-07T20:31:37.1153064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.1153752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.1154286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.1154963Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.1155627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.1156159Z     kernel = self.compile(
2025-05-07T20:31:37.1156691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.1157360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.1157751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.1157975Z 
2025-05-07T20:31:37.1158182Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c21ab10>
2025-05-07T20:31:37.1159250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.1160628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c6dd440>}
2025-05-07T20:31:37.1161976Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.1163078Z context = <triton._C.libtriton.ir.context object at 0x7f090c20a0b0>
2025-05-07T20:31:37.1163362Z 
2025-05-07T20:31:37.1163633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.1164148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.1164612Z                            module_map=module_map)
2025-05-07T20:31:37.1164972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.1165312Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.1165562Z E       ^
2025-05-07T20:31:37.1166023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.1166473Z 
2025-05-07T20:31:37.1166895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.1167409Z 
2025-05-07T20:31:37.1167522Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.1167936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.1168340Z     T=1,
2025-05-07T20:31:37.1168518Z     D=7168,
2025-05-07T20:31:37.1168707Z     scale_ub=1200.0,
2025-05-07T20:31:37.1168925Z     contiguous=False,
2025-05-07T20:31:37.1169140Z     compiled=True,
2025-05-07T20:31:37.1169337Z )
2025-05-07T20:31:37.2193801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.2194457Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:37.2194729Z 
2025-05-07T20:31:37.2194806Z     @given(
2025-05-07T20:31:37.2195033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.2195342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.2195640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.2195973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.2196503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.2196788Z     )
2025-05-07T20:31:37.2197137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.2197579Z     def test_silu_mul_quant(
2025-05-07T20:31:37.2197822Z         self,
2025-05-07T20:31:37.2198011Z         T: int,
2025-05-07T20:31:37.2198207Z         D: int,
2025-05-07T20:31:37.2198426Z         scale_ub: Optional[float],
2025-05-07T20:31:37.2198692Z         contiguous: bool,
2025-05-07T20:31:37.2198932Z         compiled: bool,
2025-05-07T20:31:37.2199158Z     ) -> None:
2025-05-07T20:31:37.2199364Z         torch.manual_seed(2025)
2025-05-07T20:31:37.2199607Z     
2025-05-07T20:31:37.2199878Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.2200216Z     
2025-05-07T20:31:37.2200410Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.2200700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.2201012Z         x = x_sign * x_clamp
2025-05-07T20:31:37.2201257Z         x0 = x[:, :D]
2025-05-07T20:31:37.2201472Z         x1 = x[:, D:]
2025-05-07T20:31:37.2201679Z     
2025-05-07T20:31:37.2201872Z         if contiguous:
2025-05-07T20:31:37.2202106Z             x0 = x0.contiguous()
2025-05-07T20:31:37.2202360Z             x1 = x1.contiguous()
2025-05-07T20:31:37.2202603Z     
2025-05-07T20:31:37.2202794Z         if scale_ub is not None:
2025-05-07T20:31:37.2203066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.2203553Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.2203864Z             )
2025-05-07T20:31:37.2204055Z         else:
2025-05-07T20:31:37.2204261Z             scale_ub_tensor = None
2025-05-07T20:31:37.2204514Z     
2025-05-07T20:31:37.2204750Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.2205059Z             op = silu_mul_quant
2025-05-07T20:31:37.2205324Z             if compiled:
2025-05-07T20:31:37.2205577Z                 op = torch.compile(op)
2025-05-07T20:31:37.2206025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.2206320Z     
2025-05-07T20:31:37.2206516Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.2206678Z 
2025-05-07T20:31:37.2206780Z moe/activation_test.py:117: 
2025-05-07T20:31:37.2207070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.2207404Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.2207686Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.2208242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:37.2208806Z     return fn(*args, **kwargs)
2025-05-07T20:31:37.2209469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.2210160Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.2210699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.2211395Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.2212063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.2212596Z     kernel = self.compile(
2025-05-07T20:31:37.2213141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.2213804Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.2214205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.2214432Z 
2025-05-07T20:31:37.2214641Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffeb41d0>
2025-05-07T20:31:37.2215811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.2217192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c6dcc20>}
2025-05-07T20:31:37.2218550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.2219579Z context = <triton._C.libtriton.ir.context object at 0x7f08ffefc0b0>
2025-05-07T20:31:37.2219864Z 
2025-05-07T20:31:37.2220032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.2220555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.2221027Z                            module_map=module_map)
2025-05-07T20:31:37.2221389Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.2221754Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.2222014Z E       ^
2025-05-07T20:31:37.2222483Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.2222936Z 
2025-05-07T20:31:37.2223357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.2223882Z 
2025-05-07T20:31:37.2223985Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.2224398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.2224801Z     T=1,
2025-05-07T20:31:37.2224980Z     D=7168,
2025-05-07T20:31:37.2225175Z     scale_ub=None,
2025-05-07T20:31:37.2225392Z     contiguous=False,
2025-05-07T20:31:37.2225609Z     compiled=True,
2025-05-07T20:31:37.2225807Z )
2025-05-07T20:31:37.2916270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.2916953Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:37.2917214Z 
2025-05-07T20:31:37.2917288Z     @given(
2025-05-07T20:31:37.2917517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.2917836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.2918136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.2918465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.2918791Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.2919079Z     )
2025-05-07T20:31:37.2919427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.2919869Z     def test_silu_mul_quant(
2025-05-07T20:31:37.2920111Z         self,
2025-05-07T20:31:37.2920297Z         T: int,
2025-05-07T20:31:37.2920496Z         D: int,
2025-05-07T20:31:37.2920711Z         scale_ub: Optional[float],
2025-05-07T20:31:37.2920988Z         contiguous: bool,
2025-05-07T20:31:37.2921226Z         compiled: bool,
2025-05-07T20:31:37.2921448Z     ) -> None:
2025-05-07T20:31:37.2921654Z         torch.manual_seed(2025)
2025-05-07T20:31:37.2921894Z     
2025-05-07T20:31:37.2922163Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.2922500Z     
2025-05-07T20:31:37.2922697Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.2922989Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.2923294Z         x = x_sign * x_clamp
2025-05-07T20:31:37.2923638Z         x0 = x[:, :D]
2025-05-07T20:31:37.2923851Z         x1 = x[:, D:]
2025-05-07T20:31:37.2924056Z     
2025-05-07T20:31:37.2924236Z         if contiguous:
2025-05-07T20:31:37.2924467Z             x0 = x0.contiguous()
2025-05-07T20:31:37.2924723Z             x1 = x1.contiguous()
2025-05-07T20:31:37.2924956Z     
2025-05-07T20:31:37.2925146Z         if scale_ub is not None:
2025-05-07T20:31:37.2925540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.2925879Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.2926190Z             )
2025-05-07T20:31:37.2926383Z         else:
2025-05-07T20:31:37.2926589Z             scale_ub_tensor = None
2025-05-07T20:31:37.2926837Z     
2025-05-07T20:31:37.2927066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.2927376Z             op = silu_mul_quant
2025-05-07T20:31:37.2927623Z             if compiled:
2025-05-07T20:31:37.2927868Z                 op = torch.compile(op)
2025-05-07T20:31:37.2928155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.2928427Z     
2025-05-07T20:31:37.2928621Z         y_fp8, y_scale = fn()
2025-05-07T20:31:37.2928905Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:37.2929188Z     
2025-05-07T20:31:37.2929426Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.2929756Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:37.2930051Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:37.2930366Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:37.2930727Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.2931034Z     
2025-05-07T20:31:37.2931237Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:37.2931429Z 
2025-05-07T20:31:37.2931536Z moe/activation_test.py:126: 
2025-05-07T20:31:37.2931831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.2932162Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:37.2932488Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.2933279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:37.2934030Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:37.2934584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.2935359Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.2936056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:37.2936772Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.2937528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:37.2938280Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.2939196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:37.2939835Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:37.2940447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:37.2940975Z     fn()
2025-05-07T20:31:37.2941482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:37.2942063Z     self.fn.run(
2025-05-07T20:31:37.2942532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.2943063Z     kernel = self.compile(
2025-05-07T20:31:37.2943598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.2944254Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.2944646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.2944878Z 
2025-05-07T20:31:37.2945088Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffed5ed0>
2025-05-07T20:31:37.2946296Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.2948030Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090dc9eb60>}
2025-05-07T20:31:37.2949713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.2950984Z context = <triton._C.libtriton.ir.context object at 0x7f08ffe21e30>
2025-05-07T20:31:37.2951323Z 
2025-05-07T20:31:37.2951509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.2952135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.2952691Z                            module_map=module_map)
2025-05-07T20:31:37.2953101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.2953500Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:37.2953793Z E       ^
2025-05-07T20:31:37.2954335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.2954889Z 
2025-05-07T20:31:37.2955399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.2956040Z 
2025-05-07T20:31:37.2956149Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.2956624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.2957090Z     T=1,
2025-05-07T20:31:37.2957278Z     D=5120,
2025-05-07T20:31:37.2957483Z     scale_ub=1200.0,
2025-05-07T20:31:37.2957725Z     contiguous=False,
2025-05-07T20:31:37.2957966Z     compiled=True,
2025-05-07T20:31:37.2958326Z )
2025-05-07T20:31:37.4166244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.4167487Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:37.4168029Z 
2025-05-07T20:31:37.4168188Z     @given(
2025-05-07T20:31:37.4168640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.4169261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.4169853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.4170504Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.4171146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.4171704Z     )
2025-05-07T20:31:37.4172379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.4173259Z     def test_silu_mul_quant(
2025-05-07T20:31:37.4173721Z         self,
2025-05-07T20:31:37.4174096Z         T: int,
2025-05-07T20:31:37.4174505Z         D: int,
2025-05-07T20:31:37.4174922Z         scale_ub: Optional[float],
2025-05-07T20:31:37.4175440Z         contiguous: bool,
2025-05-07T20:31:37.4175865Z         compiled: bool,
2025-05-07T20:31:37.4176113Z     ) -> None:
2025-05-07T20:31:37.4176345Z         torch.manual_seed(2025)
2025-05-07T20:31:37.4176585Z     
2025-05-07T20:31:37.4176855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.4177192Z     
2025-05-07T20:31:37.4177388Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.4177679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.4177988Z         x = x_sign * x_clamp
2025-05-07T20:31:37.4178220Z         x0 = x[:, :D]
2025-05-07T20:31:37.4178440Z         x1 = x[:, D:]
2025-05-07T20:31:37.4178643Z     
2025-05-07T20:31:37.4178822Z         if contiguous:
2025-05-07T20:31:37.4179054Z             x0 = x0.contiguous()
2025-05-07T20:31:37.4179317Z             x1 = x1.contiguous()
2025-05-07T20:31:37.4179726Z     
2025-05-07T20:31:37.4179920Z         if scale_ub is not None:
2025-05-07T20:31:37.4180194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.4180522Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.4180832Z             )
2025-05-07T20:31:37.4181025Z         else:
2025-05-07T20:31:37.4181227Z             scale_ub_tensor = None
2025-05-07T20:31:37.4181470Z     
2025-05-07T20:31:37.4181700Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.4182009Z             op = silu_mul_quant
2025-05-07T20:31:37.4182258Z             if compiled:
2025-05-07T20:31:37.4182503Z                 op = torch.compile(op)
2025-05-07T20:31:37.4182799Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.4183070Z     
2025-05-07T20:31:37.4183264Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.4183424Z 
2025-05-07T20:31:37.4183528Z moe/activation_test.py:117: 
2025-05-07T20:31:37.4183823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.4184151Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.4184429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.4184984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:37.4185550Z     return fn(*args, **kwargs)
2025-05-07T20:31:37.4186213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.4186902Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.4187436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.4188121Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.4188793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.4189324Z     kernel = self.compile(
2025-05-07T20:31:37.4190000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.4190659Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.4191051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.4191277Z 
2025-05-07T20:31:37.4191484Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c865f90>
2025-05-07T20:31:37.4192554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.4193909Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090dc9f880>}
2025-05-07T20:31:37.4195244Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.4196267Z context = <triton._C.libtriton.ir.context object at 0x7f090c8b1e30>
2025-05-07T20:31:37.4196552Z 
2025-05-07T20:31:37.4196719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.4197248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.4197719Z                            module_map=module_map)
2025-05-07T20:31:37.4198076Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.4198428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.4198685Z E       ^
2025-05-07T20:31:37.4199148Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.4199603Z 
2025-05-07T20:31:37.4200112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.4200634Z 
2025-05-07T20:31:37.4200739Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.4201147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.4201544Z     T=1,
2025-05-07T20:31:37.4201723Z     D=5120,
2025-05-07T20:31:37.4201911Z     scale_ub=1200.0,
2025-05-07T20:31:37.4202129Z     contiguous=False,
2025-05-07T20:31:37.4202347Z     compiled=False,
2025-05-07T20:31:37.4202543Z )
2025-05-07T20:31:37.4202857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.4203342Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:37.4203735Z 
2025-05-07T20:31:37.4203812Z     @given(
2025-05-07T20:31:37.4204042Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.4204355Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.4204654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.4204975Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.4205296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.4205572Z     )
2025-05-07T20:31:37.4205916Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.4206403Z     def test_silu_mul_quant(
2025-05-07T20:31:37.4206641Z         self,
2025-05-07T20:31:37.4206826Z         T: int,
2025-05-07T20:31:37.4207019Z         D: int,
2025-05-07T20:31:37.4207235Z         scale_ub: Optional[float],
2025-05-07T20:31:37.4207501Z         contiguous: bool,
2025-05-07T20:31:37.4207730Z         compiled: bool,
2025-05-07T20:31:37.4207949Z     ) -> None:
2025-05-07T20:31:37.4208153Z         torch.manual_seed(2025)
2025-05-07T20:31:37.4208388Z     
2025-05-07T20:31:37.4208654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.4208993Z     
2025-05-07T20:31:37.4209273Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.4209559Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.4209855Z         x = x_sign * x_clamp
2025-05-07T20:31:37.4210084Z         x0 = x[:, :D]
2025-05-07T20:31:37.4210290Z         x1 = x[:, D:]
2025-05-07T20:31:37.4210491Z     
2025-05-07T20:31:37.4210674Z         if contiguous:
2025-05-07T20:31:37.4210903Z             x0 = x0.contiguous()
2025-05-07T20:31:37.4211149Z             x1 = x1.contiguous()
2025-05-07T20:31:37.4211381Z     
2025-05-07T20:31:37.4211572Z         if scale_ub is not None:
2025-05-07T20:31:37.4211834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.4212161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.4212467Z             )
2025-05-07T20:31:37.4212654Z         else:
2025-05-07T20:31:37.4212854Z             scale_ub_tensor = None
2025-05-07T20:31:37.4213096Z     
2025-05-07T20:31:37.4213329Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.4213651Z             op = silu_mul_quant
2025-05-07T20:31:37.4213894Z             if compiled:
2025-05-07T20:31:37.4214132Z                 op = torch.compile(op)
2025-05-07T20:31:37.4214423Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.4214687Z     
2025-05-07T20:31:37.4214877Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.4215036Z 
2025-05-07T20:31:37.4215136Z moe/activation_test.py:117: 
2025-05-07T20:31:37.4215414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.4215738Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.4216011Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.4216693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.4217380Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.4217999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.4218687Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.4219344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.4219875Z     kernel = self.compile(
2025-05-07T20:31:37.4220414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.4221072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.4221460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.4221686Z 
2025-05-07T20:31:37.4221891Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c89f3d0>
2025-05-07T20:31:37.4222965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.4224325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090ce76480>}
2025-05-07T20:31:37.4225658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.4226678Z context = <triton._C.libtriton.ir.context object at 0x7f090c8a3270>
2025-05-07T20:31:37.4226968Z 
2025-05-07T20:31:37.4227132Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.4227647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.4228104Z                            module_map=module_map)
2025-05-07T20:31:37.4228465Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.4228961Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.4229211Z E       ^
2025-05-07T20:31:37.4229672Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.4230124Z 
2025-05-07T20:31:37.4230544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.4231060Z 
2025-05-07T20:31:37.4231166Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.4231567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.4231968Z     T=16384,
2025-05-07T20:31:37.4232152Z     D=5120,
2025-05-07T20:31:37.4232336Z     scale_ub=1200.0,
2025-05-07T20:31:37.4232555Z     contiguous=False,
2025-05-07T20:31:37.4232773Z     compiled=True,
2025-05-07T20:31:37.4232967Z )
2025-05-07T20:31:37.7119091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.7120415Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:37.7120976Z 
2025-05-07T20:31:37.7121138Z     @given(
2025-05-07T20:31:37.7121580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.7122197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.7122795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.7123597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.7124264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.7124811Z     )
2025-05-07T20:31:37.7125502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.7126348Z     def test_silu_mul_quant(
2025-05-07T20:31:37.7126581Z         self,
2025-05-07T20:31:37.7126777Z         T: int,
2025-05-07T20:31:37.7126974Z         D: int,
2025-05-07T20:31:37.7127184Z         scale_ub: Optional[float],
2025-05-07T20:31:37.7127645Z         contiguous: bool,
2025-05-07T20:31:37.7127895Z         compiled: bool,
2025-05-07T20:31:37.7128116Z     ) -> None:
2025-05-07T20:31:37.7128333Z         torch.manual_seed(2025)
2025-05-07T20:31:37.7128571Z     
2025-05-07T20:31:37.7128836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.7129179Z     
2025-05-07T20:31:37.7129378Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.7129664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.7129966Z         x = x_sign * x_clamp
2025-05-07T20:31:37.7130202Z         x0 = x[:, :D]
2025-05-07T20:31:37.7130417Z         x1 = x[:, D:]
2025-05-07T20:31:37.7130615Z     
2025-05-07T20:31:37.7130800Z         if contiguous:
2025-05-07T20:31:37.7131031Z             x0 = x0.contiguous()
2025-05-07T20:31:37.7131283Z             x1 = x1.contiguous()
2025-05-07T20:31:37.7131521Z     
2025-05-07T20:31:37.7131710Z         if scale_ub is not None:
2025-05-07T20:31:37.7131979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.7132315Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.7132627Z             )
2025-05-07T20:31:37.7132811Z         else:
2025-05-07T20:31:37.7133021Z             scale_ub_tensor = None
2025-05-07T20:31:37.7133270Z     
2025-05-07T20:31:37.7133494Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.7133802Z             op = silu_mul_quant
2025-05-07T20:31:37.7134049Z             if compiled:
2025-05-07T20:31:37.7134289Z                 op = torch.compile(op)
2025-05-07T20:31:37.7134584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.7134853Z     
2025-05-07T20:31:37.7135045Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.7135207Z 
2025-05-07T20:31:37.7135306Z moe/activation_test.py:117: 
2025-05-07T20:31:37.7135592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.7135919Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.7136196Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.7136913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:37.7137470Z     return fn(*args, **kwargs)
2025-05-07T20:31:37.7138129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.7138998Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.7139538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.7140222Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.7140883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.7141413Z     kernel = self.compile(
2025-05-07T20:31:37.7141965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.7142628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.7143016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.7143247Z 
2025-05-07T20:31:37.7143453Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c811d90>
2025-05-07T20:31:37.7144523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.7145882Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090ce751c0>}
2025-05-07T20:31:37.7147389Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.7148425Z context = <triton._C.libtriton.ir.context object at 0x7f090c899c70>
2025-05-07T20:31:37.7148718Z 
2025-05-07T20:31:37.7148883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.7149405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.7149868Z                            module_map=module_map)
2025-05-07T20:31:37.7150234Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.7150586Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.7150842Z E       ^
2025-05-07T20:31:37.7151305Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.7151762Z 
2025-05-07T20:31:37.7152184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.7152702Z 
2025-05-07T20:31:37.7152808Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.7153213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.7153613Z     T=2048,
2025-05-07T20:31:37.7153798Z     D=7168,
2025-05-07T20:31:37.7153987Z     scale_ub=1200.0,
2025-05-07T20:31:37.7154205Z     contiguous=False,
2025-05-07T20:31:37.7154435Z     compiled=True,
2025-05-07T20:31:37.7154635Z )
2025-05-07T20:31:37.7154948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.7155449Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:37.7155721Z 
2025-05-07T20:31:37.7155802Z     @given(
2025-05-07T20:31:37.7156024Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.7156336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.7156638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.7156962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.7157416Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.7157702Z     )
2025-05-07T20:31:37.7158050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.7158484Z     def test_silu_mul_quant(
2025-05-07T20:31:37.7158722Z         self,
2025-05-07T20:31:37.7158915Z         T: int,
2025-05-07T20:31:37.7159104Z         D: int,
2025-05-07T20:31:37.7159316Z         scale_ub: Optional[float],
2025-05-07T20:31:37.7159588Z         contiguous: bool,
2025-05-07T20:31:37.7159820Z         compiled: bool,
2025-05-07T20:31:37.7160041Z     ) -> None:
2025-05-07T20:31:37.7160258Z         torch.manual_seed(2025)
2025-05-07T20:31:37.7160487Z     
2025-05-07T20:31:37.7160759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.7161099Z     
2025-05-07T20:31:37.7161291Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.7161587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.7161902Z         x = x_sign * x_clamp
2025-05-07T20:31:37.7162133Z         x0 = x[:, :D]
2025-05-07T20:31:37.7162347Z         x1 = x[:, D:]
2025-05-07T20:31:37.7162556Z     
2025-05-07T20:31:37.7162741Z         if contiguous:
2025-05-07T20:31:37.7162963Z             x0 = x0.contiguous()
2025-05-07T20:31:37.7163217Z             x1 = x1.contiguous()
2025-05-07T20:31:37.7163535Z     
2025-05-07T20:31:37.7163716Z         if scale_ub is not None:
2025-05-07T20:31:37.7163988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.7164319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.7164620Z             )
2025-05-07T20:31:37.7164807Z         else:
2025-05-07T20:31:37.7165020Z             scale_ub_tensor = None
2025-05-07T20:31:37.7165263Z     
2025-05-07T20:31:37.7165495Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.7165803Z             op = silu_mul_quant
2025-05-07T20:31:37.7166127Z             if compiled:
2025-05-07T20:31:37.7166378Z                 op = torch.compile(op)
2025-05-07T20:31:37.7166671Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.7166938Z     
2025-05-07T20:31:37.7167131Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.7167298Z 
2025-05-07T20:31:37.7167395Z moe/activation_test.py:117: 
2025-05-07T20:31:37.7167690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.7168012Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.7168291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.7168848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:37.7169402Z     return fn(*args, **kwargs)
2025-05-07T20:31:37.7170060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.7170753Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.7171300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.7171977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.7172641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.7173174Z     kernel = self.compile(
2025-05-07T20:31:37.7173709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.7174364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.7174757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.7174983Z 
2025-05-07T20:31:37.7175195Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c556c10>
2025-05-07T20:31:37.7176266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.7177712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090ce76fc0>}
2025-05-07T20:31:37.7179057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.7180083Z context = <triton._C.libtriton.ir.context object at 0x7f090c5a3b30>
2025-05-07T20:31:37.7180369Z 
2025-05-07T20:31:37.7180542Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.7181058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.7181530Z                            module_map=module_map)
2025-05-07T20:31:37.7181904Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.7182253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.7182509Z E       ^
2025-05-07T20:31:37.7182971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.7183419Z 
2025-05-07T20:31:37.7183841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.7184355Z 
2025-05-07T20:31:37.8078255Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.8078680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.8079083Z     T=1,
2025-05-07T20:31:37.8079264Z     D=5120,
2025-05-07T20:31:37.8079490Z     scale_ub=None,
2025-05-07T20:31:37.8079833Z     contiguous=False,
2025-05-07T20:31:37.8080089Z     compiled=False,
2025-05-07T20:31:37.8080297Z )
2025-05-07T20:31:37.8080907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.8081408Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:37.8081670Z 
2025-05-07T20:31:37.8081754Z     @given(
2025-05-07T20:31:37.8081972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.8082281Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.8082580Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.8082903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.8083225Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.8083626Z     )
2025-05-07T20:31:37.8083967Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.8084398Z     def test_silu_mul_quant(
2025-05-07T20:31:37.8084633Z         self,
2025-05-07T20:31:37.8084822Z         T: int,
2025-05-07T20:31:37.8085012Z         D: int,
2025-05-07T20:31:37.8085228Z         scale_ub: Optional[float],
2025-05-07T20:31:37.8085496Z         contiguous: bool,
2025-05-07T20:31:37.8085724Z         compiled: bool,
2025-05-07T20:31:37.8085944Z     ) -> None:
2025-05-07T20:31:37.8086157Z         torch.manual_seed(2025)
2025-05-07T20:31:37.8086388Z     
2025-05-07T20:31:37.8086653Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.8086987Z     
2025-05-07T20:31:37.8087168Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.8087454Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.8087758Z         x = x_sign * x_clamp
2025-05-07T20:31:37.8087986Z         x0 = x[:, :D]
2025-05-07T20:31:37.8088197Z         x1 = x[:, D:]
2025-05-07T20:31:37.8088399Z     
2025-05-07T20:31:37.8088584Z         if contiguous:
2025-05-07T20:31:37.8088806Z             x0 = x0.contiguous()
2025-05-07T20:31:37.8089057Z             x1 = x1.contiguous()
2025-05-07T20:31:37.8089299Z     
2025-05-07T20:31:37.8089487Z         if scale_ub is not None:
2025-05-07T20:31:37.8089892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.8090222Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.8097446Z             )
2025-05-07T20:31:37.8097680Z         else:
2025-05-07T20:31:37.8097900Z             scale_ub_tensor = None
2025-05-07T20:31:37.8098152Z     
2025-05-07T20:31:37.8098383Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.8098705Z             op = silu_mul_quant
2025-05-07T20:31:37.8098955Z             if compiled:
2025-05-07T20:31:37.8099202Z                 op = torch.compile(op)
2025-05-07T20:31:37.8099497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.8099760Z     
2025-05-07T20:31:37.8099956Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.8100117Z 
2025-05-07T20:31:37.8100224Z moe/activation_test.py:117: 
2025-05-07T20:31:37.8100512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.8100853Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.8101136Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.8101821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.8102499Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.8103034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.8103709Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.8104371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.8104899Z     kernel = self.compile(
2025-05-07T20:31:37.8105440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.8106090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.8106593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.8106829Z 
2025-05-07T20:31:37.8107037Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c51fd90>
2025-05-07T20:31:37.8108113Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.8109467Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c5b0860>}
2025-05-07T20:31:37.8110795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.8111812Z context = <triton._C.libtriton.ir.context object at 0x7f090c5d3c70>
2025-05-07T20:31:37.8112103Z 
2025-05-07T20:31:37.8112266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.8112783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.8113242Z                            module_map=module_map)
2025-05-07T20:31:37.8113595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.8113943Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.8114203Z E       ^
2025-05-07T20:31:37.8114659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.8115112Z 
2025-05-07T20:31:37.8115528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.8116062Z 
2025-05-07T20:31:37.8116161Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.8116580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.8117101Z     T=4096,
2025-05-07T20:31:37.8117280Z     D=7168,
2025-05-07T20:31:37.8117465Z     scale_ub=1200.0,
2025-05-07T20:31:37.8117685Z     contiguous=False,
2025-05-07T20:31:37.8117902Z     compiled=False,
2025-05-07T20:31:37.8118102Z )
2025-05-07T20:31:37.8118422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.8118909Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:37.8119190Z 
2025-05-07T20:31:37.8119262Z     @given(
2025-05-07T20:31:37.8119487Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.8119791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.8120088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.8120409Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.8120734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.8121025Z     )
2025-05-07T20:31:37.8121369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.8121806Z     def test_silu_mul_quant(
2025-05-07T20:31:37.8122031Z         self,
2025-05-07T20:31:37.8122223Z         T: int,
2025-05-07T20:31:37.8122414Z         D: int,
2025-05-07T20:31:37.8122618Z         scale_ub: Optional[float],
2025-05-07T20:31:37.8122886Z         contiguous: bool,
2025-05-07T20:31:37.8123121Z         compiled: bool,
2025-05-07T20:31:37.8123331Z     ) -> None:
2025-05-07T20:31:37.8123648Z         torch.manual_seed(2025)
2025-05-07T20:31:37.8123887Z     
2025-05-07T20:31:37.8124157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.8124494Z     
2025-05-07T20:31:37.8124685Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.8124970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.8125276Z         x = x_sign * x_clamp
2025-05-07T20:31:37.8125593Z         x0 = x[:, :D]
2025-05-07T20:31:37.8125808Z         x1 = x[:, D:]
2025-05-07T20:31:37.8126013Z     
2025-05-07T20:31:37.8126209Z         if contiguous:
2025-05-07T20:31:37.8126463Z             x0 = x0.contiguous()
2025-05-07T20:31:37.8126712Z             x1 = x1.contiguous()
2025-05-07T20:31:37.8126948Z     
2025-05-07T20:31:37.8127130Z         if scale_ub is not None:
2025-05-07T20:31:37.8127392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.8127716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.8128011Z             )
2025-05-07T20:31:37.8128194Z         else:
2025-05-07T20:31:37.8128394Z             scale_ub_tensor = None
2025-05-07T20:31:37.8128631Z     
2025-05-07T20:31:37.8128859Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.8129165Z             op = silu_mul_quant
2025-05-07T20:31:37.8129404Z             if compiled:
2025-05-07T20:31:37.8129645Z                 op = torch.compile(op)
2025-05-07T20:31:37.8129939Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.8130208Z     
2025-05-07T20:31:37.8130386Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.8130554Z 
2025-05-07T20:31:37.8130649Z moe/activation_test.py:117: 
2025-05-07T20:31:37.8130937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.8131259Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.8131534Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.8132216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.8132899Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.8133426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.8134104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.8134770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.8135375Z     kernel = self.compile(
2025-05-07T20:31:37.8135913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.8136568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.8136958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.8137182Z 
2025-05-07T20:31:37.8137387Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffbc24d0>
2025-05-07T20:31:37.8138711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.8140107Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c5b19e0>}
2025-05-07T20:31:37.8141444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.8142457Z context = <triton._C.libtriton.ir.context object at 0x7f08ffb5e3b0>
2025-05-07T20:31:37.8142747Z 
2025-05-07T20:31:37.8142912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.8143430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.8143891Z                            module_map=module_map)
2025-05-07T20:31:37.8144243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.8144590Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.8144844Z E       ^
2025-05-07T20:31:37.8145448Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.8145909Z 
2025-05-07T20:31:37.8146328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.8146851Z 
2025-05-07T20:31:37.8146952Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.8147360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.8147750Z     T=16384,
2025-05-07T20:31:37.8147936Z     D=7168,
2025-05-07T20:31:37.8148121Z     scale_ub=None,
2025-05-07T20:31:37.8148322Z     contiguous=True,
2025-05-07T20:31:37.8148537Z     compiled=True,
2025-05-07T20:31:37.8148733Z )
2025-05-07T20:31:37.9524533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.9525052Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:37.9525359Z 
2025-05-07T20:31:37.9525439Z     @given(
2025-05-07T20:31:37.9525679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.9526008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.9526488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.9526848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.9527177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.9527461Z     )
2025-05-07T20:31:37.9527813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.9528261Z     def test_silu_mul_quant(
2025-05-07T20:31:37.9528494Z         self,
2025-05-07T20:31:37.9528689Z         T: int,
2025-05-07T20:31:37.9528892Z         D: int,
2025-05-07T20:31:37.9529105Z         scale_ub: Optional[float],
2025-05-07T20:31:37.9529378Z         contiguous: bool,
2025-05-07T20:31:37.9529618Z         compiled: bool,
2025-05-07T20:31:37.9529837Z     ) -> None:
2025-05-07T20:31:37.9530053Z         torch.manual_seed(2025)
2025-05-07T20:31:37.9530293Z     
2025-05-07T20:31:37.9530569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.9531082Z     
2025-05-07T20:31:37.9531274Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.9531562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.9531874Z         x = x_sign * x_clamp
2025-05-07T20:31:37.9532112Z         x0 = x[:, :D]
2025-05-07T20:31:37.9532327Z         x1 = x[:, D:]
2025-05-07T20:31:37.9532525Z     
2025-05-07T20:31:37.9532713Z         if contiguous:
2025-05-07T20:31:37.9532949Z             x0 = x0.contiguous()
2025-05-07T20:31:37.9533201Z             x1 = x1.contiguous()
2025-05-07T20:31:37.9533440Z     
2025-05-07T20:31:37.9533632Z         if scale_ub is not None:
2025-05-07T20:31:37.9533899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.9534236Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.9534548Z             )
2025-05-07T20:31:37.9534733Z         else:
2025-05-07T20:31:37.9534935Z             scale_ub_tensor = None
2025-05-07T20:31:37.9535200Z     
2025-05-07T20:31:37.9535424Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.9535731Z             op = silu_mul_quant
2025-05-07T20:31:37.9535981Z             if compiled:
2025-05-07T20:31:37.9536220Z                 op = torch.compile(op)
2025-05-07T20:31:37.9536517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.9536788Z     
2025-05-07T20:31:37.9536980Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.9537144Z 
2025-05-07T20:31:37.9537241Z moe/activation_test.py:117: 
2025-05-07T20:31:37.9537538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.9537870Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.9538148Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.9538913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:37.9539480Z     return fn(*args, **kwargs)
2025-05-07T20:31:37.9540268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.9540961Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.9541494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.9542180Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.9542840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.9543376Z     kernel = self.compile(
2025-05-07T20:31:37.9543919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.9544580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.9544971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.9545212Z 
2025-05-07T20:31:37.9545425Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffbe9150>
2025-05-07T20:31:37.9546496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.9547856Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c5b2b60>}
2025-05-07T20:31:37.9549197Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.9550220Z context = <triton._C.libtriton.ir.context object at 0x7f08ffbc9030>
2025-05-07T20:31:37.9550514Z 
2025-05-07T20:31:37.9550685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.9551365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.9551825Z                            module_map=module_map)
2025-05-07T20:31:37.9552200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.9552552Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.9552811Z E       ^
2025-05-07T20:31:37.9553270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.9553726Z 
2025-05-07T20:31:37.9554149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.9554665Z 
2025-05-07T20:31:37.9554774Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.9555186Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.9555592Z     T=4096,
2025-05-07T20:31:37.9555796Z     D=5120,
2025-05-07T20:31:37.9555984Z     scale_ub=None,
2025-05-07T20:31:37.9556198Z     contiguous=False,
2025-05-07T20:31:37.9556419Z     compiled=True,
2025-05-07T20:31:37.9556622Z )
2025-05-07T20:31:37.9556936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.9557432Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:37.9557705Z 
2025-05-07T20:31:37.9557787Z     @given(
2025-05-07T20:31:37.9558012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.9558322Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.9558625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.9558948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.9559271Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.9559556Z     )
2025-05-07T20:31:37.9559984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.9560431Z     def test_silu_mul_quant(
2025-05-07T20:31:37.9560674Z         self,
2025-05-07T20:31:37.9560875Z         T: int,
2025-05-07T20:31:37.9561061Z         D: int,
2025-05-07T20:31:37.9561283Z         scale_ub: Optional[float],
2025-05-07T20:31:37.9561549Z         contiguous: bool,
2025-05-07T20:31:37.9561779Z         compiled: bool,
2025-05-07T20:31:37.9562003Z     ) -> None:
2025-05-07T20:31:37.9562212Z         torch.manual_seed(2025)
2025-05-07T20:31:37.9562447Z     
2025-05-07T20:31:37.9562720Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.9563064Z     
2025-05-07T20:31:37.9563251Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.9563638Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.9563950Z         x = x_sign * x_clamp
2025-05-07T20:31:37.9564183Z         x0 = x[:, :D]
2025-05-07T20:31:37.9564397Z         x1 = x[:, D:]
2025-05-07T20:31:37.9564605Z     
2025-05-07T20:31:37.9564814Z         if contiguous:
2025-05-07T20:31:37.9565041Z             x0 = x0.contiguous()
2025-05-07T20:31:37.9565297Z             x1 = x1.contiguous()
2025-05-07T20:31:37.9565539Z     
2025-05-07T20:31:37.9565725Z         if scale_ub is not None:
2025-05-07T20:31:37.9565996Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.9566326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.9566625Z             )
2025-05-07T20:31:37.9566816Z         else:
2025-05-07T20:31:37.9567027Z             scale_ub_tensor = None
2025-05-07T20:31:37.9567279Z     
2025-05-07T20:31:37.9567509Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.9567823Z             op = silu_mul_quant
2025-05-07T20:31:37.9568068Z             if compiled:
2025-05-07T20:31:37.9568308Z                 op = torch.compile(op)
2025-05-07T20:31:37.9568603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.9568881Z     
2025-05-07T20:31:37.9569063Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:37.9569329Z 
2025-05-07T20:31:37.9569427Z moe/activation_test.py:117: 
2025-05-07T20:31:37.9569724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.9570055Z moe/activation_test.py:115: in fn
2025-05-07T20:31:37.9570340Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.9570899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:37.9571461Z     return fn(*args, **kwargs)
2025-05-07T20:31:37.9572115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:37.9572811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:37.9573354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.9574033Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.9574708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.9575254Z     kernel = self.compile(
2025-05-07T20:31:37.9575796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.9576453Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.9576844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.9577073Z 
2025-05-07T20:31:37.9577287Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c4f2050>
2025-05-07T20:31:37.9578363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.9579795Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c5b3d80>}
2025-05-07T20:31:37.9581154Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.9582189Z context = <triton._C.libtriton.ir.context object at 0x7f090c4bdf30>
2025-05-07T20:31:37.9582474Z 
2025-05-07T20:31:37.9582654Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.9583177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.9583653Z                            module_map=module_map)
2025-05-07T20:31:37.9584013Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.9584365Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.9584619Z E       ^
2025-05-07T20:31:37.9585087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.9585545Z 
2025-05-07T20:31:37.9585966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.9586489Z 
2025-05-07T20:31:38.0743335Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.0743782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.0744302Z     T=4096,
2025-05-07T20:31:38.0744491Z     D=5120,
2025-05-07T20:31:38.0744757Z     scale_ub=1200.0,
2025-05-07T20:31:38.0745026Z     contiguous=False,
2025-05-07T20:31:38.0745313Z     compiled=False,
2025-05-07T20:31:38.0745519Z )
2025-05-07T20:31:38.0745833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.0746333Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:38.0746612Z 
2025-05-07T20:31:38.0746704Z     @given(
2025-05-07T20:31:38.0747117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.0747430Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.0747733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.0748052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.0748384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.0748665Z     )
2025-05-07T20:31:38.0749012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.0749446Z     def test_silu_mul_quant(
2025-05-07T20:31:38.0749685Z         self,
2025-05-07T20:31:38.0749878Z         T: int,
2025-05-07T20:31:38.0750069Z         D: int,
2025-05-07T20:31:38.0750284Z         scale_ub: Optional[float],
2025-05-07T20:31:38.0750554Z         contiguous: bool,
2025-05-07T20:31:38.0750787Z         compiled: bool,
2025-05-07T20:31:38.0751010Z     ) -> None:
2025-05-07T20:31:38.0751219Z         torch.manual_seed(2025)
2025-05-07T20:31:38.0751464Z     
2025-05-07T20:31:38.0751739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.0752080Z     
2025-05-07T20:31:38.0752273Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.0752563Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.0752875Z         x = x_sign * x_clamp
2025-05-07T20:31:38.0753112Z         x0 = x[:, :D]
2025-05-07T20:31:38.0753325Z         x1 = x[:, D:]
2025-05-07T20:31:38.0753532Z     
2025-05-07T20:31:38.0753724Z         if contiguous:
2025-05-07T20:31:38.0753950Z             x0 = x0.contiguous()
2025-05-07T20:31:38.0754206Z             x1 = x1.contiguous()
2025-05-07T20:31:38.0754443Z     
2025-05-07T20:31:38.0754624Z         if scale_ub is not None:
2025-05-07T20:31:38.0754897Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.0755226Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.0755529Z             )
2025-05-07T20:31:38.0756337Z         else:
2025-05-07T20:31:38.0756577Z             scale_ub_tensor = None
2025-05-07T20:31:38.0756819Z     
2025-05-07T20:31:38.0757052Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.0757365Z             op = silu_mul_quant
2025-05-07T20:31:38.0757603Z             if compiled:
2025-05-07T20:31:38.0757846Z                 op = torch.compile(op)
2025-05-07T20:31:38.0758142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.0758408Z     
2025-05-07T20:31:38.0758598Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.0758762Z 
2025-05-07T20:31:38.0758859Z moe/activation_test.py:117: 
2025-05-07T20:31:38.0759147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.0759469Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.0759749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.0760446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.0761136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.0761673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.0762357Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.0763023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.0763710Z     kernel = self.compile(
2025-05-07T20:31:38.0764256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.0764917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.0765307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.0765535Z 
2025-05-07T20:31:38.0765740Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c462ad0>
2025-05-07T20:31:38.0766821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.0768271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c444c20>}
2025-05-07T20:31:38.0769612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.0770633Z context = <triton._C.libtriton.ir.context object at 0x7f090c4229b0>
2025-05-07T20:31:38.0770927Z 
2025-05-07T20:31:38.0771093Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.0771621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.0772092Z                            module_map=module_map)
2025-05-07T20:31:38.0772450Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.0772814Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.0773072Z E       ^
2025-05-07T20:31:38.0773532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.0773987Z 
2025-05-07T20:31:38.0774404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.0774925Z 
2025-05-07T20:31:38.0775029Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.0775442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.0775842Z     T=4096,
2025-05-07T20:31:38.0782559Z     D=5120,
2025-05-07T20:31:38.0782766Z     scale_ub=1200.0,
2025-05-07T20:31:38.0783003Z     contiguous=False,
2025-05-07T20:31:38.0783350Z     compiled=True,
2025-05-07T20:31:38.0783555Z )
2025-05-07T20:31:38.0783877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.0784380Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:38.0784656Z 
2025-05-07T20:31:38.0784733Z     @given(
2025-05-07T20:31:38.0784967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.0785275Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.0785572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.0785903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.0786261Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.0786559Z     )
2025-05-07T20:31:38.0786898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.0787339Z     def test_silu_mul_quant(
2025-05-07T20:31:38.0787580Z         self,
2025-05-07T20:31:38.0787771Z         T: int,
2025-05-07T20:31:38.0787978Z         D: int,
2025-05-07T20:31:38.0788194Z         scale_ub: Optional[float],
2025-05-07T20:31:38.0788458Z         contiguous: bool,
2025-05-07T20:31:38.0788726Z         compiled: bool,
2025-05-07T20:31:38.0789046Z     ) -> None:
2025-05-07T20:31:38.0789290Z         torch.manual_seed(2025)
2025-05-07T20:31:38.0789532Z     
2025-05-07T20:31:38.0789811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.0790150Z     
2025-05-07T20:31:38.0790342Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.0790636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.0790941Z         x = x_sign * x_clamp
2025-05-07T20:31:38.0791182Z         x0 = x[:, :D]
2025-05-07T20:31:38.0791396Z         x1 = x[:, D:]
2025-05-07T20:31:38.0791602Z     
2025-05-07T20:31:38.0791789Z         if contiguous:
2025-05-07T20:31:38.0792024Z             x0 = x0.contiguous()
2025-05-07T20:31:38.0792294Z             x1 = x1.contiguous()
2025-05-07T20:31:38.0792629Z     
2025-05-07T20:31:38.0792821Z         if scale_ub is not None:
2025-05-07T20:31:38.0793089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.0793415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.0793722Z             )
2025-05-07T20:31:38.0793912Z         else:
2025-05-07T20:31:38.0794113Z             scale_ub_tensor = None
2025-05-07T20:31:38.0794368Z     
2025-05-07T20:31:38.0794601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.0794905Z             op = silu_mul_quant
2025-05-07T20:31:38.0795156Z             if compiled:
2025-05-07T20:31:38.0795402Z                 op = torch.compile(op)
2025-05-07T20:31:38.0795692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.0795965Z     
2025-05-07T20:31:38.0796355Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.0796516Z 
2025-05-07T20:31:38.0796616Z moe/activation_test.py:117: 
2025-05-07T20:31:38.0796905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.0797241Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.0797530Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.0798084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:38.0798649Z     return fn(*args, **kwargs)
2025-05-07T20:31:38.0799311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.0800244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.0800861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.0801547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.0802215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.0802862Z     kernel = self.compile(
2025-05-07T20:31:38.0803516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.0804174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.0804570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.0804801Z 
2025-05-07T20:31:38.0805008Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c4b1310>
2025-05-07T20:31:38.0806084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.0807448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c445f80>}
2025-05-07T20:31:38.0808797Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.0809821Z context = <triton._C.libtriton.ir.context object at 0x7f090c4a11f0>
2025-05-07T20:31:38.0810103Z 
2025-05-07T20:31:38.0810272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.0810790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.0811256Z                            module_map=module_map)
2025-05-07T20:31:38.0811616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.0811968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.0812219Z E       ^
2025-05-07T20:31:38.0812684Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.0813138Z 
2025-05-07T20:31:38.0813647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.0814160Z 
2025-05-07T20:31:38.1694062Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.1694504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.1695037Z     T=2048,
2025-05-07T20:31:38.1695304Z     D=7168,
2025-05-07T20:31:38.1695506Z     scale_ub=1200.0,
2025-05-07T20:31:38.1695797Z     contiguous=False,
2025-05-07T20:31:38.1696026Z     compiled=False,
2025-05-07T20:31:38.1696223Z )
2025-05-07T20:31:38.1696547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.1697047Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:38.1697321Z 
2025-05-07T20:31:38.1697405Z     @given(
2025-05-07T20:31:38.1697635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.1697958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.1698299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.1698626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.1698953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.1699233Z     )
2025-05-07T20:31:38.1699586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.1700031Z     def test_silu_mul_quant(
2025-05-07T20:31:38.1700269Z         self,
2025-05-07T20:31:38.1700467Z         T: int,
2025-05-07T20:31:38.1700668Z         D: int,
2025-05-07T20:31:38.1700884Z         scale_ub: Optional[float],
2025-05-07T20:31:38.1701156Z         contiguous: bool,
2025-05-07T20:31:38.1701401Z         compiled: bool,
2025-05-07T20:31:38.1701624Z     ) -> None:
2025-05-07T20:31:38.1701838Z         torch.manual_seed(2025)
2025-05-07T20:31:38.1702076Z     
2025-05-07T20:31:38.1702518Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.1702876Z     
2025-05-07T20:31:38.1703079Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.1703371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.1703676Z         x = x_sign * x_clamp
2025-05-07T20:31:38.1703917Z         x0 = x[:, :D]
2025-05-07T20:31:38.1704137Z         x1 = x[:, D:]
2025-05-07T20:31:38.1704343Z     
2025-05-07T20:31:38.1704531Z         if contiguous:
2025-05-07T20:31:38.1704763Z             x0 = x0.contiguous()
2025-05-07T20:31:38.1705014Z             x1 = x1.contiguous()
2025-05-07T20:31:38.1705257Z     
2025-05-07T20:31:38.1705452Z         if scale_ub is not None:
2025-05-07T20:31:38.1705722Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.1706059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.1706369Z             )
2025-05-07T20:31:38.1706556Z         else:
2025-05-07T20:31:38.1706767Z             scale_ub_tensor = None
2025-05-07T20:31:38.1707016Z     
2025-05-07T20:31:38.1707248Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.1707557Z             op = silu_mul_quant
2025-05-07T20:31:38.1707804Z             if compiled:
2025-05-07T20:31:38.1708048Z                 op = torch.compile(op)
2025-05-07T20:31:38.1708336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.1708607Z     
2025-05-07T20:31:38.1708794Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.1708955Z 
2025-05-07T20:31:38.1709052Z moe/activation_test.py:117: 
2025-05-07T20:31:38.1709343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.1709676Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.1709949Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.1710636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.1711327Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.1711874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.1712682Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.1713346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.1713878Z     kernel = self.compile(
2025-05-07T20:31:38.1714417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.1715077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.1715471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.1715698Z 
2025-05-07T20:31:38.1715907Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffc39ed0>
2025-05-07T20:31:38.1716974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.1718514Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c446d40>}
2025-05-07T20:31:38.1719857Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.1720886Z context = <triton._C.libtriton.ir.context object at 0x7f08ffca9db0>
2025-05-07T20:31:38.1721172Z 
2025-05-07T20:31:38.1721342Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.1721864Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.1722332Z                            module_map=module_map)
2025-05-07T20:31:38.1722795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.1723149Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.1723535Z E       ^
2025-05-07T20:31:38.1724003Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.1724454Z 
2025-05-07T20:31:38.1724878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.1725395Z 
2025-05-07T20:31:38.1725497Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.1725911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.1726310Z     T=1,
2025-05-07T20:31:38.1726486Z     D=7168,
2025-05-07T20:31:38.1726675Z     scale_ub=None,
2025-05-07T20:31:38.1726885Z     contiguous=True,
2025-05-07T20:31:38.1727099Z     compiled=False,
2025-05-07T20:31:38.1727298Z )
2025-05-07T20:31:38.1727628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.1728115Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:38.1728373Z 
2025-05-07T20:31:38.1728449Z     @given(
2025-05-07T20:31:38.1728672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.1728983Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.1729281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.1729606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.1729932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.1730209Z     )
2025-05-07T20:31:38.1730554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.1730998Z     def test_silu_mul_quant(
2025-05-07T20:31:38.1731230Z         self,
2025-05-07T20:31:38.1731416Z         T: int,
2025-05-07T20:31:38.1731610Z         D: int,
2025-05-07T20:31:38.1731826Z         scale_ub: Optional[float],
2025-05-07T20:31:38.1732223Z         contiguous: bool,
2025-05-07T20:31:38.1732461Z         compiled: bool,
2025-05-07T20:31:38.1732680Z     ) -> None:
2025-05-07T20:31:38.1732894Z         torch.manual_seed(2025)
2025-05-07T20:31:38.1733137Z     
2025-05-07T20:31:38.1733396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.1733740Z     
2025-05-07T20:31:38.1733935Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.1734223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.1734527Z         x = x_sign * x_clamp
2025-05-07T20:31:38.1734765Z         x0 = x[:, :D]
2025-05-07T20:31:38.1734974Z         x1 = x[:, D:]
2025-05-07T20:31:38.1735174Z     
2025-05-07T20:31:38.1735360Z         if contiguous:
2025-05-07T20:31:38.1735586Z             x0 = x0.contiguous()
2025-05-07T20:31:38.1735841Z             x1 = x1.contiguous()
2025-05-07T20:31:38.1736080Z     
2025-05-07T20:31:38.1736292Z         if scale_ub is not None:
2025-05-07T20:31:38.1736584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.1736919Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.1737223Z             )
2025-05-07T20:31:38.1737411Z         else:
2025-05-07T20:31:38.1737616Z             scale_ub_tensor = None
2025-05-07T20:31:38.1737862Z     
2025-05-07T20:31:38.1738087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.1738628Z             op = silu_mul_quant
2025-05-07T20:31:38.1738895Z             if compiled:
2025-05-07T20:31:38.1739140Z                 op = torch.compile(op)
2025-05-07T20:31:38.1739435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.1739703Z     
2025-05-07T20:31:38.1739893Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.1740056Z 
2025-05-07T20:31:38.1740154Z moe/activation_test.py:117: 
2025-05-07T20:31:38.1740442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.1740770Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.1741237Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.1741936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.1742626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.1743162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.1743839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.1744502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.1745031Z     kernel = self.compile(
2025-05-07T20:31:38.1745565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.1746228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.1746625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.1746854Z 
2025-05-07T20:31:38.1747065Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffcfd050>
2025-05-07T20:31:38.1748128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.1749489Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c446fc0>}
2025-05-07T20:31:38.1750825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.1751844Z context = <triton._C.libtriton.ir.context object at 0x7f08ffc00ef0>
2025-05-07T20:31:38.1752256Z 
2025-05-07T20:31:38.1752431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.1752946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.1753410Z                            module_map=module_map)
2025-05-07T20:31:38.1753771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.1754119Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.1754374Z E       ^
2025-05-07T20:31:38.1754839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.1755290Z 
2025-05-07T20:31:38.1755716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.1756231Z 
2025-05-07T20:31:38.1756333Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.1756750Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.1757157Z     T=16384,
2025-05-07T20:31:38.1757344Z     D=7168,
2025-05-07T20:31:38.1757534Z     scale_ub=1200.0,
2025-05-07T20:31:38.1757750Z     contiguous=False,
2025-05-07T20:31:38.1757968Z     compiled=True,
2025-05-07T20:31:38.5815138Z )
2025-05-07T20:31:38.5815819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.5816565Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:38.5816956Z 
2025-05-07T20:31:38.5817063Z     @given(
2025-05-07T20:31:38.5817381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.5817788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.5818088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.5818417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.5818744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.5819028Z     )
2025-05-07T20:31:38.5819568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.5820047Z     def test_silu_mul_quant(
2025-05-07T20:31:38.5820290Z         self,
2025-05-07T20:31:38.5820481Z         T: int,
2025-05-07T20:31:38.5820675Z         D: int,
2025-05-07T20:31:38.5820886Z         scale_ub: Optional[float],
2025-05-07T20:31:38.5821157Z         contiguous: bool,
2025-05-07T20:31:38.5821389Z         compiled: bool,
2025-05-07T20:31:38.5821608Z     ) -> None:
2025-05-07T20:31:38.5821829Z         torch.manual_seed(2025)
2025-05-07T20:31:38.5822070Z     
2025-05-07T20:31:38.5822337Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.5822682Z     
2025-05-07T20:31:38.5822874Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.5823163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.5823476Z         x = x_sign * x_clamp
2025-05-07T20:31:38.5823713Z         x0 = x[:, :D]
2025-05-07T20:31:38.5823930Z         x1 = x[:, D:]
2025-05-07T20:31:38.5824140Z     
2025-05-07T20:31:38.5824324Z         if contiguous:
2025-05-07T20:31:38.5824553Z             x0 = x0.contiguous()
2025-05-07T20:31:38.5824808Z             x1 = x1.contiguous()
2025-05-07T20:31:38.5825049Z     
2025-05-07T20:31:38.5825237Z         if scale_ub is not None:
2025-05-07T20:31:38.5825512Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.5825847Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.5826156Z             )
2025-05-07T20:31:38.5826345Z         else:
2025-05-07T20:31:38.5826551Z             scale_ub_tensor = None
2025-05-07T20:31:38.5826802Z     
2025-05-07T20:31:38.5827030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.5827343Z             op = silu_mul_quant
2025-05-07T20:31:38.5827598Z             if compiled:
2025-05-07T20:31:38.5827838Z                 op = torch.compile(op)
2025-05-07T20:31:38.5828135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.5828411Z     
2025-05-07T20:31:38.5828726Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.5828894Z 
2025-05-07T20:31:38.5828995Z moe/activation_test.py:117: 
2025-05-07T20:31:38.5829290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.5829618Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.5829898Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.5830461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:38.5831024Z     return fn(*args, **kwargs)
2025-05-07T20:31:38.5831681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.5832372Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.5832911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.5833601Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.5834268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.5834801Z     kernel = self.compile(
2025-05-07T20:31:38.5835344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.5835999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.5836394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.5836632Z 
2025-05-07T20:31:38.5836876Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c3c5050>
2025-05-07T20:31:38.5837973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.5839865Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffc3d300>}
2025-05-07T20:31:38.5841204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.5842231Z context = <triton._C.libtriton.ir.context object at 0x7f090c3075f0>
2025-05-07T20:31:38.5842525Z 
2025-05-07T20:31:38.5842692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.5843218Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.5843834Z                            module_map=module_map)
2025-05-07T20:31:38.5844201Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.5844567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.5844829Z E       ^
2025-05-07T20:31:38.5845291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.5845748Z 
2025-05-07T20:31:38.5846170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.5846683Z 
2025-05-07T20:31:38.5846793Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.5847198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.5847596Z     T=1,
2025-05-07T20:31:38.5847782Z     D=7168,
2025-05-07T20:31:38.5847973Z     scale_ub=None,
2025-05-07T20:31:38.5848180Z     contiguous=False,
2025-05-07T20:31:38.5848402Z     compiled=False,
2025-05-07T20:31:38.5848605Z )
2025-05-07T20:31:38.5848914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.5849407Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:38.5849795Z 
2025-05-07T20:31:38.5849875Z     @given(
2025-05-07T20:31:38.5850100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.5850409Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.5850708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.5851028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.5851349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.5851632Z     )
2025-05-07T20:31:38.5851980Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.5852417Z     def test_silu_mul_quant(
2025-05-07T20:31:38.5852653Z         self,
2025-05-07T20:31:38.5852846Z         T: int,
2025-05-07T20:31:38.5853037Z         D: int,
2025-05-07T20:31:38.5853251Z         scale_ub: Optional[float],
2025-05-07T20:31:38.5853524Z         contiguous: bool,
2025-05-07T20:31:38.5853753Z         compiled: bool,
2025-05-07T20:31:38.5853979Z     ) -> None:
2025-05-07T20:31:38.5854199Z         torch.manual_seed(2025)
2025-05-07T20:31:38.5854431Z     
2025-05-07T20:31:38.5854702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.5855045Z     
2025-05-07T20:31:38.5855230Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.5855520Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.5855826Z         x = x_sign * x_clamp
2025-05-07T20:31:38.5856058Z         x0 = x[:, :D]
2025-05-07T20:31:38.5856270Z         x1 = x[:, D:]
2025-05-07T20:31:38.5856477Z     
2025-05-07T20:31:38.5856670Z         if contiguous:
2025-05-07T20:31:38.5856896Z             x0 = x0.contiguous()
2025-05-07T20:31:38.5857146Z             x1 = x1.contiguous()
2025-05-07T20:31:38.5857382Z     
2025-05-07T20:31:38.5857569Z         if scale_ub is not None:
2025-05-07T20:31:38.5857833Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.5864887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.5865317Z             )
2025-05-07T20:31:38.5865518Z         else:
2025-05-07T20:31:38.5865725Z             scale_ub_tensor = None
2025-05-07T20:31:38.5865972Z     
2025-05-07T20:31:38.5866208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.5866522Z             op = silu_mul_quant
2025-05-07T20:31:38.5866772Z             if compiled:
2025-05-07T20:31:38.5867010Z                 op = torch.compile(op)
2025-05-07T20:31:38.5867304Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.5867576Z     
2025-05-07T20:31:38.5867760Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.5867929Z 
2025-05-07T20:31:38.5868028Z moe/activation_test.py:117: 
2025-05-07T20:31:38.5868322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.5868648Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.5868924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.5869618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.5870310Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.5870845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.5871530Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.5872195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.5872721Z     kernel = self.compile(
2025-05-07T20:31:38.5873264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.5873920Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.5874317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.5874546Z 
2025-05-07T20:31:38.5874761Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c39c890>
2025-05-07T20:31:38.5875925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.5877332Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffc3e0c0>}
2025-05-07T20:31:38.5878669Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.5879682Z context = <triton._C.libtriton.ir.context object at 0x7f090c3bc730>
2025-05-07T20:31:38.5879964Z 
2025-05-07T20:31:38.5880128Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.5880654Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.5881114Z                            module_map=module_map)
2025-05-07T20:31:38.5881471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.5881819Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.5882076Z E       ^
2025-05-07T20:31:38.5882536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.5882984Z 
2025-05-07T20:31:38.5883487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.5884005Z 
2025-05-07T20:31:38.5884108Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.5884514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.5884908Z     T=2048,
2025-05-07T20:31:38.5885095Z     D=7168,
2025-05-07T20:31:38.5885411Z     scale_ub=None,
2025-05-07T20:31:38.5885623Z     contiguous=False,
2025-05-07T20:31:38.5885839Z     compiled=True,
2025-05-07T20:31:38.5886040Z )
2025-05-07T20:31:38.6570197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.6570902Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:38.6571289Z 
2025-05-07T20:31:38.6571391Z     @given(
2025-05-07T20:31:38.6571721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.6572135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.6572564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.6573073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.6573517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.6573920Z     )
2025-05-07T20:31:38.6574276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.6574726Z     def test_silu_mul_quant(
2025-05-07T20:31:38.6574970Z         self,
2025-05-07T20:31:38.6575158Z         T: int,
2025-05-07T20:31:38.6575345Z         D: int,
2025-05-07T20:31:38.6575557Z         scale_ub: Optional[float],
2025-05-07T20:31:38.6575824Z         contiguous: bool,
2025-05-07T20:31:38.6576054Z         compiled: bool,
2025-05-07T20:31:38.6576274Z     ) -> None:
2025-05-07T20:31:38.6576487Z         torch.manual_seed(2025)
2025-05-07T20:31:38.6576719Z     
2025-05-07T20:31:38.6576999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.6577341Z     
2025-05-07T20:31:38.6577527Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.6577814Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.6578118Z         x = x_sign * x_clamp
2025-05-07T20:31:38.6578352Z         x0 = x[:, :D]
2025-05-07T20:31:38.6578566Z         x1 = x[:, D:]
2025-05-07T20:31:38.6578764Z     
2025-05-07T20:31:38.6578945Z         if contiguous:
2025-05-07T20:31:38.6579176Z             x0 = x0.contiguous()
2025-05-07T20:31:38.6579601Z             x1 = x1.contiguous()
2025-05-07T20:31:38.6579836Z     
2025-05-07T20:31:38.6580024Z         if scale_ub is not None:
2025-05-07T20:31:38.6580286Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.6580618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.6580922Z             )
2025-05-07T20:31:38.6581106Z         else:
2025-05-07T20:31:38.6581307Z             scale_ub_tensor = None
2025-05-07T20:31:38.6581553Z     
2025-05-07T20:31:38.6581785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.6582087Z             op = silu_mul_quant
2025-05-07T20:31:38.6582331Z             if compiled:
2025-05-07T20:31:38.6582570Z                 op = torch.compile(op)
2025-05-07T20:31:38.6582856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.6583127Z     
2025-05-07T20:31:38.6583318Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.6583480Z 
2025-05-07T20:31:38.6583580Z moe/activation_test.py:117: 
2025-05-07T20:31:38.6583872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.6584202Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.6584477Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.6585036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:38.6585597Z     return fn(*args, **kwargs)
2025-05-07T20:31:38.6586256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.6586939Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.6587481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.6588161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.6588949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.6589487Z     kernel = self.compile(
2025-05-07T20:31:38.6590028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.6590684Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.6591067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.6591299Z 
2025-05-07T20:31:38.6591504Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c35f390>
2025-05-07T20:31:38.6592583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.6593957Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffc3f560>}
2025-05-07T20:31:38.6595299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.6596317Z context = <triton._C.libtriton.ir.context object at 0x7f090c3f7230>
2025-05-07T20:31:38.6596610Z 
2025-05-07T20:31:38.6596775Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.6597295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.6597757Z                            module_map=module_map)
2025-05-07T20:31:38.6598113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.6598468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.6598723Z E       ^
2025-05-07T20:31:38.6599185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.6599722Z 
2025-05-07T20:31:38.6600141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.6600663Z 
2025-05-07T20:31:38.6600761Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.6601173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.6601568Z     T=4096,
2025-05-07T20:31:38.6601750Z     D=7168,
2025-05-07T20:31:38.6601935Z     scale_ub=None,
2025-05-07T20:31:38.6602144Z     contiguous=False,
2025-05-07T20:31:38.6602365Z     compiled=True,
2025-05-07T20:31:38.6602562Z )
2025-05-07T20:31:38.6602874Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.6603503Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:38.6603777Z 
2025-05-07T20:31:38.6603859Z     @given(
2025-05-07T20:31:38.6604085Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.6604400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.6604708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.6605029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.6605351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.6605630Z     )
2025-05-07T20:31:38.6605974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.6606408Z     def test_silu_mul_quant(
2025-05-07T20:31:38.6606646Z         self,
2025-05-07T20:31:38.6606836Z         T: int,
2025-05-07T20:31:38.6607020Z         D: int,
2025-05-07T20:31:38.6607230Z         scale_ub: Optional[float],
2025-05-07T20:31:38.6607497Z         contiguous: bool,
2025-05-07T20:31:38.6607734Z         compiled: bool,
2025-05-07T20:31:38.6607953Z     ) -> None:
2025-05-07T20:31:38.6608162Z         torch.manual_seed(2025)
2025-05-07T20:31:38.6608395Z     
2025-05-07T20:31:38.6608749Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.6609093Z     
2025-05-07T20:31:38.6609285Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.6609569Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.6609871Z         x = x_sign * x_clamp
2025-05-07T20:31:38.6610104Z         x0 = x[:, :D]
2025-05-07T20:31:38.6610313Z         x1 = x[:, D:]
2025-05-07T20:31:38.6610515Z     
2025-05-07T20:31:38.6610700Z         if contiguous:
2025-05-07T20:31:38.6610926Z             x0 = x0.contiguous()
2025-05-07T20:31:38.6611179Z             x1 = x1.contiguous()
2025-05-07T20:31:38.6611416Z     
2025-05-07T20:31:38.6611597Z         if scale_ub is not None:
2025-05-07T20:31:38.6611864Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.6612195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.6612493Z             )
2025-05-07T20:31:38.6612680Z         else:
2025-05-07T20:31:38.6612889Z             scale_ub_tensor = None
2025-05-07T20:31:38.6613134Z     
2025-05-07T20:31:38.6613357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.6613664Z             op = silu_mul_quant
2025-05-07T20:31:38.6613902Z             if compiled:
2025-05-07T20:31:38.6614140Z                 op = torch.compile(op)
2025-05-07T20:31:38.6614426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.6614697Z     
2025-05-07T20:31:38.6614880Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.6615045Z 
2025-05-07T20:31:38.6615142Z moe/activation_test.py:117: 
2025-05-07T20:31:38.6615428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.6615749Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.6616024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.6616579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:38.6617131Z     return fn(*args, **kwargs)
2025-05-07T20:31:38.6617907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.6618600Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.6619129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.6619814Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.6620485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.6621021Z     kernel = self.compile(
2025-05-07T20:31:38.6621562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.6622218Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.6622612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.6622849Z 
2025-05-07T20:31:38.6623063Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffad0310>
2025-05-07T20:31:38.6624128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.6625494Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffa287c0>}
2025-05-07T20:31:38.6626841Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.6627861Z context = <triton._C.libtriton.ir.context object at 0x7f08ffab41b0>
2025-05-07T20:31:38.6628148Z 
2025-05-07T20:31:38.6628391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.6628918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.6629389Z                            module_map=module_map)
2025-05-07T20:31:38.6629745Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.6630086Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.6630341Z E       ^
2025-05-07T20:31:38.6630801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.6631251Z 
2025-05-07T20:31:38.6631671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.6632184Z 
2025-05-07T20:31:38.7905536Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.7906141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.7906744Z     T=16384,
2025-05-07T20:31:38.7907033Z     D=5120,
2025-05-07T20:31:38.7907287Z     scale_ub=1200.0,
2025-05-07T20:31:38.7907594Z     contiguous=False,
2025-05-07T20:31:38.7907904Z     compiled=False,
2025-05-07T20:31:38.7908159Z )
2025-05-07T20:31:38.7908475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.7908974Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:38.7909251Z 
2025-05-07T20:31:38.7909333Z     @given(
2025-05-07T20:31:38.7909557Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.7909869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.7910171Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.7910491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.7910815Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.7911097Z     )
2025-05-07T20:31:38.7911441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.7912050Z     def test_silu_mul_quant(
2025-05-07T20:31:38.7912283Z         self,
2025-05-07T20:31:38.7912477Z         T: int,
2025-05-07T20:31:38.7912664Z         D: int,
2025-05-07T20:31:38.7912880Z         scale_ub: Optional[float],
2025-05-07T20:31:38.7913152Z         contiguous: bool,
2025-05-07T20:31:38.7913384Z         compiled: bool,
2025-05-07T20:31:38.7913602Z     ) -> None:
2025-05-07T20:31:38.7913819Z         torch.manual_seed(2025)
2025-05-07T20:31:38.7914051Z     
2025-05-07T20:31:38.7914319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.7914658Z     
2025-05-07T20:31:38.7914846Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.7915136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.7915440Z         x = x_sign * x_clamp
2025-05-07T20:31:38.7915670Z         x0 = x[:, :D]
2025-05-07T20:31:38.7915882Z         x1 = x[:, D:]
2025-05-07T20:31:38.7916087Z     
2025-05-07T20:31:38.7916272Z         if contiguous:
2025-05-07T20:31:38.7916504Z             x0 = x0.contiguous()
2025-05-07T20:31:38.7916757Z             x1 = x1.contiguous()
2025-05-07T20:31:38.7916989Z     
2025-05-07T20:31:38.7917178Z         if scale_ub is not None:
2025-05-07T20:31:38.7917444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.7917775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.7918071Z             )
2025-05-07T20:31:38.7918262Z         else:
2025-05-07T20:31:38.7918467Z             scale_ub_tensor = None
2025-05-07T20:31:38.7918711Z     
2025-05-07T20:31:38.7918940Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.7919249Z             op = silu_mul_quant
2025-05-07T20:31:38.7919490Z             if compiled:
2025-05-07T20:31:38.7919733Z                 op = torch.compile(op)
2025-05-07T20:31:38.7920031Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.7920293Z     
2025-05-07T20:31:38.7920609Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.7920777Z 
2025-05-07T20:31:38.7920880Z moe/activation_test.py:117: 
2025-05-07T20:31:38.7921172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.7921495Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.7921773Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.7922465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.7923152Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.7923799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.7924480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.7925145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.7925674Z     kernel = self.compile(
2025-05-07T20:31:38.7926229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.7926934Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.7927319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.7927553Z 
2025-05-07T20:31:38.7927763Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffa0cd90>
2025-05-07T20:31:38.7928834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.7930197Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffa29620>}
2025-05-07T20:31:38.7931529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.7932627Z context = <triton._C.libtriton.ir.context object at 0x7f08ffaa4c70>
2025-05-07T20:31:38.7932916Z 
2025-05-07T20:31:38.7933081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.7933601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.7934064Z                            module_map=module_map)
2025-05-07T20:31:38.7934427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.7934782Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.7935041Z E       ^
2025-05-07T20:31:38.7935501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.7935960Z 
2025-05-07T20:31:38.7936384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.7936911Z 
2025-05-07T20:31:38.7937014Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.7937424Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.7937821Z     T=16384,
2025-05-07T20:31:38.7938012Z     D=5120,
2025-05-07T20:31:38.7938198Z     scale_ub=1200.0,
2025-05-07T20:31:38.7938627Z     contiguous=True,
2025-05-07T20:31:38.7938852Z     compiled=True,
2025-05-07T20:31:38.7939053Z )
2025-05-07T20:31:38.7939364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.7939858Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:38.7940129Z 
2025-05-07T20:31:38.7940215Z     @given(
2025-05-07T20:31:38.7940437Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.7940747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.7941569Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.7941904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.7942226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.7942509Z     )
2025-05-07T20:31:38.7942859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.7943291Z     def test_silu_mul_quant(
2025-05-07T20:31:38.7943552Z         self,
2025-05-07T20:31:38.7943745Z         T: int,
2025-05-07T20:31:38.7943938Z         D: int,
2025-05-07T20:31:38.7944151Z         scale_ub: Optional[float],
2025-05-07T20:31:38.7944418Z         contiguous: bool,
2025-05-07T20:31:38.7944652Z         compiled: bool,
2025-05-07T20:31:38.7944869Z     ) -> None:
2025-05-07T20:31:38.7945079Z         torch.manual_seed(2025)
2025-05-07T20:31:38.7945320Z     
2025-05-07T20:31:38.7945582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.7945922Z     
2025-05-07T20:31:38.7946124Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.7946413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.7946714Z         x = x_sign * x_clamp
2025-05-07T20:31:38.7946949Z         x0 = x[:, :D]
2025-05-07T20:31:38.7947159Z         x1 = x[:, D:]
2025-05-07T20:31:38.7947361Z     
2025-05-07T20:31:38.7947542Z         if contiguous:
2025-05-07T20:31:38.7947772Z             x0 = x0.contiguous()
2025-05-07T20:31:38.7948022Z             x1 = x1.contiguous()
2025-05-07T20:31:38.7948260Z     
2025-05-07T20:31:38.7948452Z         if scale_ub is not None:
2025-05-07T20:31:38.7948716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.7949045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.7949348Z             )
2025-05-07T20:31:38.7949534Z         else:
2025-05-07T20:31:38.7949741Z             scale_ub_tensor = None
2025-05-07T20:31:38.7949992Z     
2025-05-07T20:31:38.7950229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.7950675Z             op = silu_mul_quant
2025-05-07T20:31:38.7950922Z             if compiled:
2025-05-07T20:31:38.7951157Z                 op = torch.compile(op)
2025-05-07T20:31:38.7951455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.7951727Z     
2025-05-07T20:31:38.7951915Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:38.7952076Z 
2025-05-07T20:31:38.7952174Z moe/activation_test.py:117: 
2025-05-07T20:31:38.7952462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.7952791Z moe/activation_test.py:115: in fn
2025-05-07T20:31:38.7953068Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.7953625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:38.7954184Z     return fn(*args, **kwargs)
2025-05-07T20:31:38.7954849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:38.7955544Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:38.7956081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.7956816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.7957477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.7958009Z     kernel = self.compile(
2025-05-07T20:31:38.7958552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.7959216Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.7959608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.7959837Z 
2025-05-07T20:31:38.7960154Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffd0f590>
2025-05-07T20:31:38.7961231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.7962593Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffa2aa20>}
2025-05-07T20:31:38.7964038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.7965066Z context = <triton._C.libtriton.ir.context object at 0x7f08ffd07470>
2025-05-07T20:31:38.7965360Z 
2025-05-07T20:31:38.7965525Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.7966059Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.7966525Z                            module_map=module_map)
2025-05-07T20:31:38.7966885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.7967242Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.7967499Z E       ^
2025-05-07T20:31:38.7967956Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.7968408Z 
2025-05-07T20:31:38.7968827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.7969340Z 
2025-05-07T20:31:39.1530351Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.1530795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.1531406Z     T=16384,
2025-05-07T20:31:39.1531683Z     D=5120,
2025-05-07T20:31:39.1531893Z     scale_ub=None,
2025-05-07T20:31:39.1532115Z     contiguous=False,
2025-05-07T20:31:39.1532523Z     compiled=True,
2025-05-07T20:31:39.1532737Z )
2025-05-07T20:31:39.1533062Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.1533559Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:39.1533848Z 
2025-05-07T20:31:39.1533931Z     @given(
2025-05-07T20:31:39.1534167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.1534481Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.1534788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.1535123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.1535444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.1535735Z     )
2025-05-07T20:31:39.1536089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.1536565Z     def test_silu_mul_quant(
2025-05-07T20:31:39.1536825Z         self,
2025-05-07T20:31:39.1537041Z         T: int,
2025-05-07T20:31:39.1537243Z         D: int,
2025-05-07T20:31:39.1537457Z         scale_ub: Optional[float],
2025-05-07T20:31:39.1537734Z         contiguous: bool,
2025-05-07T20:31:39.1537974Z         compiled: bool,
2025-05-07T20:31:39.1538194Z     ) -> None:
2025-05-07T20:31:39.1538659Z         torch.manual_seed(2025)
2025-05-07T20:31:39.1538904Z     
2025-05-07T20:31:39.1539171Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.1539513Z     
2025-05-07T20:31:39.1539706Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.1539992Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.1540304Z         x = x_sign * x_clamp
2025-05-07T20:31:39.1540548Z         x0 = x[:, :D]
2025-05-07T20:31:39.1540760Z         x1 = x[:, D:]
2025-05-07T20:31:39.1540965Z     
2025-05-07T20:31:39.1541155Z         if contiguous:
2025-05-07T20:31:39.1541390Z             x0 = x0.contiguous()
2025-05-07T20:31:39.1541783Z             x1 = x1.contiguous()
2025-05-07T20:31:39.1542036Z     
2025-05-07T20:31:39.1542233Z         if scale_ub is not None:
2025-05-07T20:31:39.1542502Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.1542839Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.1543151Z             )
2025-05-07T20:31:39.1543344Z         else:
2025-05-07T20:31:39.1543558Z             scale_ub_tensor = None
2025-05-07T20:31:39.1543813Z     
2025-05-07T20:31:39.1544041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.1544362Z             op = silu_mul_quant
2025-05-07T20:31:39.1544608Z             if compiled:
2025-05-07T20:31:39.1544847Z                 op = torch.compile(op)
2025-05-07T20:31:39.1545145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.1545426Z     
2025-05-07T20:31:39.1545612Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.1545787Z 
2025-05-07T20:31:39.1545887Z moe/activation_test.py:117: 
2025-05-07T20:31:39.1546186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.1546531Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.1546810Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.1547379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:39.1547947Z     return fn(*args, **kwargs)
2025-05-07T20:31:39.1548606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.1549299Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.1549839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.1550520Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.1551192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.1551852Z     kernel = self.compile(
2025-05-07T20:31:39.1552416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.1553082Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.1553475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.1553704Z 
2025-05-07T20:31:39.1553915Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffda0390>
2025-05-07T20:31:39.1554997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.1556364Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffa2bc40>}
2025-05-07T20:31:39.1557713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.1558736Z context = <triton._C.libtriton.ir.context object at 0x7f08ffd00270>
2025-05-07T20:31:39.1559025Z 
2025-05-07T20:31:39.1559198Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.1559725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.1560188Z                            module_map=module_map)
2025-05-07T20:31:39.1560554Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.1560907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.1561155Z E       ^
2025-05-07T20:31:39.1561699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.1562156Z 
2025-05-07T20:31:39.1562584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.1563097Z 
2025-05-07T20:31:39.1563207Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.1563750Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.1564147Z     T=2048,
2025-05-07T20:31:39.1564338Z     D=5120,
2025-05-07T20:31:39.1564522Z     scale_ub=None,
2025-05-07T20:31:39.1564741Z     contiguous=False,
2025-05-07T20:31:39.1564963Z     compiled=True,
2025-05-07T20:31:39.1565163Z )
2025-05-07T20:31:39.2294062Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.2294585Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:39.2294887Z 
2025-05-07T20:31:39.2294999Z     @given(
2025-05-07T20:31:39.2295322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.2295653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.2296030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.2296403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.2296724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.2297011Z     )
2025-05-07T20:31:39.2297361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.2297799Z     def test_silu_mul_quant(
2025-05-07T20:31:39.2298042Z         self,
2025-05-07T20:31:39.2298236Z         T: int,
2025-05-07T20:31:39.2298427Z         D: int,
2025-05-07T20:31:39.2298644Z         scale_ub: Optional[float],
2025-05-07T20:31:39.2298912Z         contiguous: bool,
2025-05-07T20:31:39.2299148Z         compiled: bool,
2025-05-07T20:31:39.2299374Z     ) -> None:
2025-05-07T20:31:39.2299590Z         torch.manual_seed(2025)
2025-05-07T20:31:39.2299823Z     
2025-05-07T20:31:39.2300102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.2300617Z     
2025-05-07T20:31:39.2300805Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.2301097Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.2301406Z         x = x_sign * x_clamp
2025-05-07T20:31:39.2301643Z         x0 = x[:, :D]
2025-05-07T20:31:39.2301853Z         x1 = x[:, D:]
2025-05-07T20:31:39.2302061Z     
2025-05-07T20:31:39.2302245Z         if contiguous:
2025-05-07T20:31:39.2302471Z             x0 = x0.contiguous()
2025-05-07T20:31:39.2302729Z             x1 = x1.contiguous()
2025-05-07T20:31:39.2302967Z     
2025-05-07T20:31:39.2303150Z         if scale_ub is not None:
2025-05-07T20:31:39.2303431Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.2303765Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.2304062Z             )
2025-05-07T20:31:39.2304256Z         else:
2025-05-07T20:31:39.2304465Z             scale_ub_tensor = None
2025-05-07T20:31:39.2304711Z     
2025-05-07T20:31:39.2304951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.2305288Z             op = silu_mul_quant
2025-05-07T20:31:39.2305537Z             if compiled:
2025-05-07T20:31:39.2305779Z                 op = torch.compile(op)
2025-05-07T20:31:39.2306075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.2306346Z     
2025-05-07T20:31:39.2306530Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.2306698Z 
2025-05-07T20:31:39.2306797Z moe/activation_test.py:117: 
2025-05-07T20:31:39.2307093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.2307425Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.2307700Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.2308262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:39.2308824Z     return fn(*args, **kwargs)
2025-05-07T20:31:39.2309613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.2310317Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.2310858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.2311542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.2312203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.2312742Z     kernel = self.compile(
2025-05-07T20:31:39.2313290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.2313955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.2314349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.2314583Z 
2025-05-07T20:31:39.2314797Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff699290>
2025-05-07T20:31:39.2315881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.2317294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffdb87c0>}
2025-05-07T20:31:39.2318636Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.2319666Z context = <triton._C.libtriton.ir.context object at 0x7f08ff691170>
2025-05-07T20:31:39.2319958Z 
2025-05-07T20:31:39.2320126Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.2320736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.2321204Z                            module_map=module_map)
2025-05-07T20:31:39.2321568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.2321922Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.2322176Z E       ^
2025-05-07T20:31:39.2322644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.2323104Z 
2025-05-07T20:31:39.2323647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.2324166Z 
2025-05-07T20:31:39.2324274Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.2324685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.2325088Z     T=2048,
2025-05-07T20:31:39.2325274Z     D=5120,
2025-05-07T20:31:39.2325474Z     scale_ub=1200.0,
2025-05-07T20:31:39.2325699Z     contiguous=False,
2025-05-07T20:31:39.2325927Z     compiled=True,
2025-05-07T20:31:39.2326125Z )
2025-05-07T20:31:39.2326449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.2326948Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:39.2327222Z 
2025-05-07T20:31:39.2327300Z     @given(
2025-05-07T20:31:39.2327524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.2327840Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.2328149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.2328474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.2328808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.2329095Z     )
2025-05-07T20:31:39.2329442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.2329996Z     def test_silu_mul_quant(
2025-05-07T20:31:39.2330246Z         self,
2025-05-07T20:31:39.2330434Z         T: int,
2025-05-07T20:31:39.2330631Z         D: int,
2025-05-07T20:31:39.2330847Z         scale_ub: Optional[float],
2025-05-07T20:31:39.2331115Z         contiguous: bool,
2025-05-07T20:31:39.2331352Z         compiled: bool,
2025-05-07T20:31:39.2331579Z     ) -> None:
2025-05-07T20:31:39.2331786Z         torch.manual_seed(2025)
2025-05-07T20:31:39.2332023Z     
2025-05-07T20:31:39.2332295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.2332635Z     
2025-05-07T20:31:39.2332837Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.2333136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.2333449Z         x = x_sign * x_clamp
2025-05-07T20:31:39.2333679Z         x0 = x[:, :D]
2025-05-07T20:31:39.2333893Z         x1 = x[:, D:]
2025-05-07T20:31:39.2334101Z     
2025-05-07T20:31:39.2334278Z         if contiguous:
2025-05-07T20:31:39.2334526Z             x0 = x0.contiguous()
2025-05-07T20:31:39.2334789Z             x1 = x1.contiguous()
2025-05-07T20:31:39.2335030Z     
2025-05-07T20:31:39.2335221Z         if scale_ub is not None:
2025-05-07T20:31:39.2335489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.2335822Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.2336130Z             )
2025-05-07T20:31:39.2336324Z         else:
2025-05-07T20:31:39.2336529Z             scale_ub_tensor = None
2025-05-07T20:31:39.2336777Z     
2025-05-07T20:31:39.2337010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.2337315Z             op = silu_mul_quant
2025-05-07T20:31:39.2337564Z             if compiled:
2025-05-07T20:31:39.2337812Z                 op = torch.compile(op)
2025-05-07T20:31:39.2338107Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.2338622Z     
2025-05-07T20:31:39.2338821Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.2338984Z 
2025-05-07T20:31:39.2339092Z moe/activation_test.py:117: 
2025-05-07T20:31:39.2339509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.2339840Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.2340119Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.2340676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:39.2341241Z     return fn(*args, **kwargs)
2025-05-07T20:31:39.2341909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.2342603Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.2343143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.2343831Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.2344508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.2345044Z     kernel = self.compile(
2025-05-07T20:31:39.2345593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.2346257Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.2346659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.2346886Z 
2025-05-07T20:31:39.2347097Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff6add50>
2025-05-07T20:31:39.2348179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.2349665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffdb98a0>}
2025-05-07T20:31:39.2351022Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.2352053Z context = <triton._C.libtriton.ir.context object at 0x7f08ff68dc30>
2025-05-07T20:31:39.2352340Z 
2025-05-07T20:31:39.2352509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.2353028Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.2353492Z                            module_map=module_map)
2025-05-07T20:31:39.2353849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.2354204Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.2354469Z E       ^
2025-05-07T20:31:39.2354938Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.2355392Z 
2025-05-07T20:31:39.2355809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.2356333Z 
2025-05-07T20:31:39.3692723Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.3693604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.3694398Z     T=4096,
2025-05-07T20:31:39.3694768Z     D=5120,
2025-05-07T20:31:39.3695137Z     scale_ub=1200.0,
2025-05-07T20:31:39.3695575Z     contiguous=True,
2025-05-07T20:31:39.3696004Z     compiled=True,
2025-05-07T20:31:39.3696395Z )
2025-05-07T20:31:39.3696994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.3697539Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:39.3697808Z 
2025-05-07T20:31:39.3697883Z     @given(
2025-05-07T20:31:39.3698121Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.3698581Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.3698881Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.3699210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.3699539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.3699821Z     )
2025-05-07T20:31:39.3701470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.3701907Z     def test_silu_mul_quant(
2025-05-07T20:31:39.3702145Z         self,
2025-05-07T20:31:39.3702328Z         T: int,
2025-05-07T20:31:39.3702527Z         D: int,
2025-05-07T20:31:39.3702742Z         scale_ub: Optional[float],
2025-05-07T20:31:39.3703004Z         contiguous: bool,
2025-05-07T20:31:39.3703239Z         compiled: bool,
2025-05-07T20:31:39.3703458Z     ) -> None:
2025-05-07T20:31:39.3703667Z         torch.manual_seed(2025)
2025-05-07T20:31:39.3703907Z     
2025-05-07T20:31:39.3704191Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.3704529Z     
2025-05-07T20:31:39.3704724Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.3705021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.3705325Z         x = x_sign * x_clamp
2025-05-07T20:31:39.3705561Z         x0 = x[:, :D]
2025-05-07T20:31:39.3705780Z         x1 = x[:, D:]
2025-05-07T20:31:39.3705989Z     
2025-05-07T20:31:39.3706164Z         if contiguous:
2025-05-07T20:31:39.3706394Z             x0 = x0.contiguous()
2025-05-07T20:31:39.3706682Z             x1 = x1.contiguous()
2025-05-07T20:31:39.3706934Z     
2025-05-07T20:31:39.3707121Z         if scale_ub is not None:
2025-05-07T20:31:39.3707392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.3707719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.3708021Z             )
2025-05-07T20:31:39.3708213Z         else:
2025-05-07T20:31:39.3708538Z             scale_ub_tensor = None
2025-05-07T20:31:39.3708795Z     
2025-05-07T20:31:39.3709026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.3709335Z             op = silu_mul_quant
2025-05-07T20:31:39.3709584Z             if compiled:
2025-05-07T20:31:39.3709828Z                 op = torch.compile(op)
2025-05-07T20:31:39.3710119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.3710393Z     
2025-05-07T20:31:39.3710582Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.3712179Z 
2025-05-07T20:31:39.3712284Z moe/activation_test.py:117: 
2025-05-07T20:31:39.3712571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.3712904Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.3713184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.3713738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:39.3714302Z     return fn(*args, **kwargs)
2025-05-07T20:31:39.3714966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.3715653Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.3716186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.3716867Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.3717532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.3718058Z     kernel = self.compile(
2025-05-07T20:31:39.3718601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.3719261Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.3719661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.3719973Z 
2025-05-07T20:31:39.3720183Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff66abd0>
2025-05-07T20:31:39.3721259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.3722617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffdbaac0>}
2025-05-07T20:31:39.3724071Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.3725097Z context = <triton._C.libtriton.ir.context object at 0x7f08ff676ab0>
2025-05-07T20:31:39.3725383Z 
2025-05-07T20:31:39.3725554Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.3726078Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.3726546Z                            module_map=module_map)
2025-05-07T20:31:39.3726902Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.3727253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.3727508Z E       ^
2025-05-07T20:31:39.3727971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.3728421Z 
2025-05-07T20:31:39.3728841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.3729359Z 
2025-05-07T20:31:39.3729463Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.3729875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.3730355Z     T=128,
2025-05-07T20:31:39.3730547Z     D=5120,
2025-05-07T20:31:39.3730729Z     scale_ub=1200.0,
2025-05-07T20:31:39.3730942Z     contiguous=False,
2025-05-07T20:31:39.3731165Z     compiled=True,
2025-05-07T20:31:39.3731368Z )
2025-05-07T20:31:39.4566518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.4567045Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:39.4567329Z 
2025-05-07T20:31:39.4567409Z     @given(
2025-05-07T20:31:39.4567644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.4568056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.4568444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.4568780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.4569113Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.4569395Z     )
2025-05-07T20:31:39.4569752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.4570202Z     def test_silu_mul_quant(
2025-05-07T20:31:39.4570436Z         self,
2025-05-07T20:31:39.4570635Z         T: int,
2025-05-07T20:31:39.4570837Z         D: int,
2025-05-07T20:31:39.4571052Z         scale_ub: Optional[float],
2025-05-07T20:31:39.4571325Z         contiguous: bool,
2025-05-07T20:31:39.4571565Z         compiled: bool,
2025-05-07T20:31:39.4571786Z     ) -> None:
2025-05-07T20:31:39.4572001Z         torch.manual_seed(2025)
2025-05-07T20:31:39.4572242Z     
2025-05-07T20:31:39.4572520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.4572860Z     
2025-05-07T20:31:39.4573056Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.4573343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.4580855Z         x = x_sign * x_clamp
2025-05-07T20:31:39.4581151Z         x0 = x[:, :D]
2025-05-07T20:31:39.4581373Z         x1 = x[:, D:]
2025-05-07T20:31:39.4581588Z     
2025-05-07T20:31:39.4581954Z         if contiguous:
2025-05-07T20:31:39.4582196Z             x0 = x0.contiguous()
2025-05-07T20:31:39.4582458Z             x1 = x1.contiguous()
2025-05-07T20:31:39.4582696Z     
2025-05-07T20:31:39.4582889Z         if scale_ub is not None:
2025-05-07T20:31:39.4583160Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.4583496Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.4583838Z             )
2025-05-07T20:31:39.4584032Z         else:
2025-05-07T20:31:39.4584243Z             scale_ub_tensor = None
2025-05-07T20:31:39.4584499Z     
2025-05-07T20:31:39.4584732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.4585049Z             op = silu_mul_quant
2025-05-07T20:31:39.4585302Z             if compiled:
2025-05-07T20:31:39.4585542Z                 op = torch.compile(op)
2025-05-07T20:31:39.4585840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.4586116Z     
2025-05-07T20:31:39.4586311Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.4586493Z 
2025-05-07T20:31:39.4586596Z moe/activation_test.py:117: 
2025-05-07T20:31:39.4586892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.4587227Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.4587508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.4588075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:39.4588643Z     return fn(*args, **kwargs)
2025-05-07T20:31:39.4589298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.4589988Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.4590528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.4591341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.4592013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.4592546Z     kernel = self.compile(
2025-05-07T20:31:39.4593096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.4593753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.4594160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.4594398Z 
2025-05-07T20:31:39.4594608Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff86b850>
2025-05-07T20:31:39.4595685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.4597118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff80c540>}
2025-05-07T20:31:39.4598461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.4599494Z context = <triton._C.libtriton.ir.context object at 0x7f08ff801330>
2025-05-07T20:31:39.4599787Z 
2025-05-07T20:31:39.4599955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.4600480Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.4600947Z                            module_map=module_map)
2025-05-07T20:31:39.4601314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.4601670Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.4601924Z E       ^
2025-05-07T20:31:39.4602487Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.4602943Z 
2025-05-07T20:31:39.4603537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.4604053Z 
2025-05-07T20:31:39.4604164Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.4604568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.4604969Z     T=16384,
2025-05-07T20:31:39.4605162Z     D=7168,
2025-05-07T20:31:39.4605351Z     scale_ub=1200.0,
2025-05-07T20:31:39.4605577Z     contiguous=True,
2025-05-07T20:31:39.4605804Z     compiled=True,
2025-05-07T20:31:39.4606002Z )
2025-05-07T20:31:39.4606316Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.4606819Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:39.4607125Z 
2025-05-07T20:31:39.4607222Z     @given(
2025-05-07T20:31:39.4607460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.4607772Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.4608073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.4608389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.4608713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.4609000Z     )
2025-05-07T20:31:39.4609343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.4609784Z     def test_silu_mul_quant(
2025-05-07T20:31:39.4610028Z         self,
2025-05-07T20:31:39.4610225Z         T: int,
2025-05-07T20:31:39.4610419Z         D: int,
2025-05-07T20:31:39.4610636Z         scale_ub: Optional[float],
2025-05-07T20:31:39.4610905Z         contiguous: bool,
2025-05-07T20:31:39.4611137Z         compiled: bool,
2025-05-07T20:31:39.4611360Z     ) -> None:
2025-05-07T20:31:39.4611663Z         torch.manual_seed(2025)
2025-05-07T20:31:39.4611910Z     
2025-05-07T20:31:39.4612178Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.4612515Z     
2025-05-07T20:31:39.4612713Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.4612995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.4613298Z         x = x_sign * x_clamp
2025-05-07T20:31:39.4613534Z         x0 = x[:, :D]
2025-05-07T20:31:39.4613741Z         x1 = x[:, D:]
2025-05-07T20:31:39.4613945Z     
2025-05-07T20:31:39.4614127Z         if contiguous:
2025-05-07T20:31:39.4614350Z             x0 = x0.contiguous()
2025-05-07T20:31:39.4614612Z             x1 = x1.contiguous()
2025-05-07T20:31:39.4614847Z     
2025-05-07T20:31:39.4615027Z         if scale_ub is not None:
2025-05-07T20:31:39.4615299Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.4615634Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.4615936Z             )
2025-05-07T20:31:39.4616142Z         else:
2025-05-07T20:31:39.4616350Z             scale_ub_tensor = None
2025-05-07T20:31:39.4616600Z     
2025-05-07T20:31:39.4616830Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.4617142Z             op = silu_mul_quant
2025-05-07T20:31:39.4617385Z             if compiled:
2025-05-07T20:31:39.4617626Z                 op = torch.compile(op)
2025-05-07T20:31:39.4617918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.4618188Z     
2025-05-07T20:31:39.4618371Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.4618538Z 
2025-05-07T20:31:39.4618636Z moe/activation_test.py:117: 
2025-05-07T20:31:39.4618927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.4619247Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.4619531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.4620093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:39.4620734Z     return fn(*args, **kwargs)
2025-05-07T20:31:39.4621387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.4622075Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.4622616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.4623291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.4623952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.4624487Z     kernel = self.compile(
2025-05-07T20:31:39.4625028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.4625680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.4626083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.4626315Z 
2025-05-07T20:31:39.4626530Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff8ac410>
2025-05-07T20:31:39.4627651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.4629006Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff80d080>}
2025-05-07T20:31:39.4630348Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.4631449Z context = <triton._C.libtriton.ir.context object at 0x7f08ff8e42b0>
2025-05-07T20:31:39.4631742Z 
2025-05-07T20:31:39.4631914Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.4632427Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.4632890Z                            module_map=module_map)
2025-05-07T20:31:39.4633251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.4633607Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.4633860Z E       ^
2025-05-07T20:31:39.4634319Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.4634770Z 
2025-05-07T20:31:39.4635195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.4635709Z 
2025-05-07T20:31:39.5594356Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.5594944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.5595380Z     T=16384,
2025-05-07T20:31:39.5595584Z     D=5120,
2025-05-07T20:31:39.5595776Z     scale_ub=1200.0,
2025-05-07T20:31:39.5595987Z     contiguous=True,
2025-05-07T20:31:39.5596205Z     compiled=False,
2025-05-07T20:31:39.5596411Z )
2025-05-07T20:31:39.5596951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.5597946Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:39.5598501Z 
2025-05-07T20:31:39.5598657Z     @given(
2025-05-07T20:31:39.5599091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.5599713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.5600310Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.5600944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.5601586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.5602146Z     )
2025-05-07T20:31:39.5603137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.5604177Z     def test_silu_mul_quant(
2025-05-07T20:31:39.5604646Z         self,
2025-05-07T20:31:39.5605016Z         T: int,
2025-05-07T20:31:39.5605394Z         D: int,
2025-05-07T20:31:39.5605819Z         scale_ub: Optional[float],
2025-05-07T20:31:39.5606348Z         contiguous: bool,
2025-05-07T20:31:39.5606789Z         compiled: bool,
2025-05-07T20:31:39.5607039Z     ) -> None:
2025-05-07T20:31:39.5607283Z         torch.manual_seed(2025)
2025-05-07T20:31:39.5607520Z     
2025-05-07T20:31:39.5607790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.5608126Z     
2025-05-07T20:31:39.5608316Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.5608609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.5608918Z         x = x_sign * x_clamp
2025-05-07T20:31:39.5609153Z         x0 = x[:, :D]
2025-05-07T20:31:39.5609363Z         x1 = x[:, D:]
2025-05-07T20:31:39.5609575Z     
2025-05-07T20:31:39.5609762Z         if contiguous:
2025-05-07T20:31:39.5609989Z             x0 = x0.contiguous()
2025-05-07T20:31:39.5610249Z             x1 = x1.contiguous()
2025-05-07T20:31:39.5610488Z     
2025-05-07T20:31:39.5610679Z         if scale_ub is not None:
2025-05-07T20:31:39.5610952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.5611282Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.5611583Z             )
2025-05-07T20:31:39.5611775Z         else:
2025-05-07T20:31:39.5611980Z             scale_ub_tensor = None
2025-05-07T20:31:39.5612222Z     
2025-05-07T20:31:39.5612452Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.5612762Z             op = silu_mul_quant
2025-05-07T20:31:39.5613003Z             if compiled:
2025-05-07T20:31:39.5613252Z                 op = torch.compile(op)
2025-05-07T20:31:39.5613697Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.5613978Z     
2025-05-07T20:31:39.5614173Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.5614335Z 
2025-05-07T20:31:39.5614438Z moe/activation_test.py:117: 
2025-05-07T20:31:39.5614724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.5615053Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.5615355Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.5616045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.5616735Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.5617272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.5617959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.5618634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.5619167Z     kernel = self.compile(
2025-05-07T20:31:39.5619714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.5620373Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.5620766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.5620994Z 
2025-05-07T20:31:39.5621199Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff780e90>
2025-05-07T20:31:39.5622276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.5623643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff80e660>}
2025-05-07T20:31:39.5625067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.5626094Z context = <triton._C.libtriton.ir.context object at 0x7f08ff73cd70>
2025-05-07T20:31:39.5626382Z 
2025-05-07T20:31:39.5626550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.5627069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.5627536Z                            module_map=module_map)
2025-05-07T20:31:39.5627891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.5628243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.5628503Z E       ^
2025-05-07T20:31:39.5628963Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.5629427Z 
2025-05-07T20:31:39.5629846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.5630364Z 
2025-05-07T20:31:39.5630468Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.5630882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.5631274Z     T=1,
2025-05-07T20:31:39.5631455Z     D=7168,
2025-05-07T20:31:39.5631645Z     scale_ub=1200.0,
2025-05-07T20:31:39.5631860Z     contiguous=False,
2025-05-07T20:31:39.5632081Z     compiled=False,
2025-05-07T20:31:39.5632283Z )
2025-05-07T20:31:39.5632598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.5633077Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:39.5633347Z 
2025-05-07T20:31:39.5633424Z     @given(
2025-05-07T20:31:39.5633649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.5634039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.5634346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.5634675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.5634996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.5635280Z     )
2025-05-07T20:31:39.5635628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.5636069Z     def test_silu_mul_quant(
2025-05-07T20:31:39.5636302Z         self,
2025-05-07T20:31:39.5636492Z         T: int,
2025-05-07T20:31:39.5636683Z         D: int,
2025-05-07T20:31:39.5636890Z         scale_ub: Optional[float],
2025-05-07T20:31:39.5637158Z         contiguous: bool,
2025-05-07T20:31:39.5637399Z         compiled: bool,
2025-05-07T20:31:39.5637613Z     ) -> None:
2025-05-07T20:31:39.5637827Z         torch.manual_seed(2025)
2025-05-07T20:31:39.5638064Z     
2025-05-07T20:31:39.5638341Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.5638918Z     
2025-05-07T20:31:39.5639111Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.5639394Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.5639699Z         x = x_sign * x_clamp
2025-05-07T20:31:39.5639937Z         x0 = x[:, :D]
2025-05-07T20:31:39.5640144Z         x1 = x[:, D:]
2025-05-07T20:31:39.5640349Z     
2025-05-07T20:31:39.5640533Z         if contiguous:
2025-05-07T20:31:39.5640757Z             x0 = x0.contiguous()
2025-05-07T20:31:39.5641011Z             x1 = x1.contiguous()
2025-05-07T20:31:39.5641250Z     
2025-05-07T20:31:39.5641432Z         if scale_ub is not None:
2025-05-07T20:31:39.5641704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.5642036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.5642348Z             )
2025-05-07T20:31:39.5642534Z         else:
2025-05-07T20:31:39.5642742Z             scale_ub_tensor = None
2025-05-07T20:31:39.5642998Z     
2025-05-07T20:31:39.5643451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.5643764Z             op = silu_mul_quant
2025-05-07T20:31:39.5644018Z             if compiled:
2025-05-07T20:31:39.5644257Z                 op = torch.compile(op)
2025-05-07T20:31:39.5644555Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.5644833Z     
2025-05-07T20:31:39.5645022Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.5645189Z 
2025-05-07T20:31:39.5645286Z moe/activation_test.py:117: 
2025-05-07T20:31:39.5645575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.5645908Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.5646182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.5646872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.5647568Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.5648110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.5648804Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.5649477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.5650018Z     kernel = self.compile(
2025-05-07T20:31:39.5650559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.5651224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.5651622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.5651849Z 
2025-05-07T20:31:39.5652061Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff7c7810>
2025-05-07T20:31:39.5653254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.5654632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff80dd00>}
2025-05-07T20:31:39.5655988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.5657022Z context = <triton._C.libtriton.ir.context object at 0x7f08ff7876f0>
2025-05-07T20:31:39.5657310Z 
2025-05-07T20:31:39.5657479Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.5658002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.5658474Z                            module_map=module_map)
2025-05-07T20:31:39.5658846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.5659193Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.5659449Z E       ^
2025-05-07T20:31:39.5659914Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.5660367Z 
2025-05-07T20:31:39.5660786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.5661304Z 
2025-05-07T20:31:39.9321478Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.9322716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.9323474Z     T=4096,
2025-05-07T20:31:39.9323736Z     D=7168,
2025-05-07T20:31:39.9323933Z     scale_ub=1200.0,
2025-05-07T20:31:39.9324178Z     contiguous=False,
2025-05-07T20:31:39.9324414Z     compiled=True,
2025-05-07T20:31:39.9324627Z )
2025-05-07T20:31:39.9324983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.9325881Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:39.9326163Z 
2025-05-07T20:31:39.9326259Z     @given(
2025-05-07T20:31:39.9326492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.9326822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.9327140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.9327470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.9327808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.9328108Z     )
2025-05-07T20:31:39.9328462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.9328921Z     def test_silu_mul_quant(
2025-05-07T20:31:39.9329174Z         self,
2025-05-07T20:31:39.9329373Z         T: int,
2025-05-07T20:31:39.9329583Z         D: int,
2025-05-07T20:31:39.9329819Z         scale_ub: Optional[float],
2025-05-07T20:31:39.9330105Z         contiguous: bool,
2025-05-07T20:31:39.9330359Z         compiled: bool,
2025-05-07T20:31:39.9330603Z     ) -> None:
2025-05-07T20:31:39.9330822Z         torch.manual_seed(2025)
2025-05-07T20:31:39.9331074Z     
2025-05-07T20:31:39.9331366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.9331722Z     
2025-05-07T20:31:39.9331919Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.9332226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.9332550Z         x = x_sign * x_clamp
2025-05-07T20:31:39.9332795Z         x0 = x[:, :D]
2025-05-07T20:31:39.9333021Z         x1 = x[:, D:]
2025-05-07T20:31:39.9333242Z     
2025-05-07T20:31:39.9333434Z         if contiguous:
2025-05-07T20:31:39.9333681Z             x0 = x0.contiguous()
2025-05-07T20:31:39.9333955Z             x1 = x1.contiguous()
2025-05-07T20:31:39.9334198Z     
2025-05-07T20:31:39.9334565Z         if scale_ub is not None:
2025-05-07T20:31:39.9334861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.9335200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.9335519Z             )
2025-05-07T20:31:39.9335723Z         else:
2025-05-07T20:31:39.9335936Z             scale_ub_tensor = None
2025-05-07T20:31:39.9336202Z     
2025-05-07T20:31:39.9336454Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.9336782Z             op = silu_mul_quant
2025-05-07T20:31:39.9337035Z             if compiled:
2025-05-07T20:31:39.9337293Z                 op = torch.compile(op)
2025-05-07T20:31:39.9337604Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.9337881Z     
2025-05-07T20:31:39.9338081Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:39.9338254Z 
2025-05-07T20:31:39.9338726Z moe/activation_test.py:117: 
2025-05-07T20:31:39.9339069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.9339426Z moe/activation_test.py:115: in fn
2025-05-07T20:31:39.9339722Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.9340289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:39.9340864Z     return fn(*args, **kwargs)
2025-05-07T20:31:39.9341533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:39.9342231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:39.9342771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.9343463Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.9344139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.9344688Z     kernel = self.compile(
2025-05-07T20:31:39.9345237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.9346069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.9346472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.9346705Z 
2025-05-07T20:31:39.9346940Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff9753d0>
2025-05-07T20:31:39.9348050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.9349532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff74ccc0>}
2025-05-07T20:31:39.9350884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.9351927Z context = <triton._C.libtriton.ir.context object at 0x7f08ff99a030>
2025-05-07T20:31:39.9352218Z 
2025-05-07T20:31:39.9352401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.9352924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.9353400Z                            module_map=module_map)
2025-05-07T20:31:39.9353790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.9354146Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.9354415Z E       ^
2025-05-07T20:31:39.9354890Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.9355345Z 
2025-05-07T20:31:39.9365742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.9366336Z 
2025-05-07T20:31:39.9366451Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.9366882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.9367301Z     T=128,
2025-05-07T20:31:39.9367494Z     D=7168,
2025-05-07T20:31:39.9367699Z     scale_ub=1200.0,
2025-05-07T20:31:39.9367932Z     contiguous=False,
2025-05-07T20:31:39.9368158Z     compiled=True,
2025-05-07T20:31:39.9368373Z )
2025-05-07T20:31:40.0086255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.0087021Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:40.0087414Z 
2025-05-07T20:31:40.0087523Z     @given(
2025-05-07T20:31:40.0087846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.0088226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.0088563Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.0088922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.0089252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.0089551Z     )
2025-05-07T20:31:40.0089909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.0090363Z     def test_silu_mul_quant(
2025-05-07T20:31:40.0090602Z         self,
2025-05-07T20:31:40.0090805Z         T: int,
2025-05-07T20:31:40.0091010Z         D: int,
2025-05-07T20:31:40.0091227Z         scale_ub: Optional[float],
2025-05-07T20:31:40.0091508Z         contiguous: bool,
2025-05-07T20:31:40.0091758Z         compiled: bool,
2025-05-07T20:31:40.0091984Z     ) -> None:
2025-05-07T20:31:40.0092211Z         torch.manual_seed(2025)
2025-05-07T20:31:40.0092463Z     
2025-05-07T20:31:40.0092738Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.0093091Z     
2025-05-07T20:31:40.0093303Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.0093922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.0094248Z         x = x_sign * x_clamp
2025-05-07T20:31:40.0094495Z         x0 = x[:, :D]
2025-05-07T20:31:40.0094712Z         x1 = x[:, D:]
2025-05-07T20:31:40.0094933Z     
2025-05-07T20:31:40.0095131Z         if contiguous:
2025-05-07T20:31:40.0095367Z             x0 = x0.contiguous()
2025-05-07T20:31:40.0095639Z             x1 = x1.contiguous()
2025-05-07T20:31:40.0095893Z     
2025-05-07T20:31:40.0096094Z         if scale_ub is not None:
2025-05-07T20:31:40.0096367Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.0096713Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.0097044Z             )
2025-05-07T20:31:40.0097242Z         else:
2025-05-07T20:31:40.0097461Z             scale_ub_tensor = None
2025-05-07T20:31:40.0097725Z     
2025-05-07T20:31:40.0097961Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.0098291Z             op = silu_mul_quant
2025-05-07T20:31:40.0098560Z             if compiled:
2025-05-07T20:31:40.0098808Z                 op = torch.compile(op)
2025-05-07T20:31:40.0099117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.0099403Z     
2025-05-07T20:31:40.0099597Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.0099773Z 
2025-05-07T20:31:40.0099875Z moe/activation_test.py:117: 
2025-05-07T20:31:40.0100183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.0100526Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.0100809Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.0101375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:40.0101948Z     return fn(*args, **kwargs)
2025-05-07T20:31:40.0102617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.0103451Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.0104001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.0104704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.0105369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.0105910Z     kernel = self.compile(
2025-05-07T20:31:40.0106460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.0107176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.0107571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.0107804Z 
2025-05-07T20:31:40.0108016Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff902a10>
2025-05-07T20:31:40.0109103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.0110482Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff74d580>}
2025-05-07T20:31:40.0111822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.0112842Z context = <triton._C.libtriton.ir.context object at 0x7f08ff9ee8b0>
2025-05-07T20:31:40.0113135Z 
2025-05-07T20:31:40.0113303Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.0113839Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.0114387Z                            module_map=module_map)
2025-05-07T20:31:40.0114756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.0115112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.0115379Z E       ^
2025-05-07T20:31:40.0115842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.0116302Z 
2025-05-07T20:31:40.0116722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.0117237Z 
2025-05-07T20:31:40.0117350Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.0117772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.0118172Z     T=2048,
2025-05-07T20:31:40.0118368Z     D=7168,
2025-05-07T20:31:40.0118567Z     scale_ub=None,
2025-05-07T20:31:40.0118777Z     contiguous=True,
2025-05-07T20:31:40.0119012Z     compiled=True,
2025-05-07T20:31:40.0119232Z )
2025-05-07T20:31:40.0119549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.0120043Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.0120314Z 
2025-05-07T20:31:40.0120403Z     @given(
2025-05-07T20:31:40.0120632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.0120950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.0121263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.0121601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.0121929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.0122220Z     )
2025-05-07T20:31:40.0122577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.0123017Z     def test_silu_mul_quant(
2025-05-07T20:31:40.0123398Z         self,
2025-05-07T20:31:40.0123599Z         T: int,
2025-05-07T20:31:40.0123880Z         D: int,
2025-05-07T20:31:40.0124107Z         scale_ub: Optional[float],
2025-05-07T20:31:40.0124385Z         contiguous: bool,
2025-05-07T20:31:40.0124622Z         compiled: bool,
2025-05-07T20:31:40.0124857Z     ) -> None:
2025-05-07T20:31:40.0125082Z         torch.manual_seed(2025)
2025-05-07T20:31:40.0125325Z     
2025-05-07T20:31:40.0125607Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.0125958Z     
2025-05-07T20:31:40.0126151Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.0126445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.0126759Z         x = x_sign * x_clamp
2025-05-07T20:31:40.0127003Z         x0 = x[:, :D]
2025-05-07T20:31:40.0127215Z         x1 = x[:, D:]
2025-05-07T20:31:40.0127446Z     
2025-05-07T20:31:40.0127640Z         if contiguous:
2025-05-07T20:31:40.0127869Z             x0 = x0.contiguous()
2025-05-07T20:31:40.0128137Z             x1 = x1.contiguous()
2025-05-07T20:31:40.0128391Z     
2025-05-07T20:31:40.0128590Z         if scale_ub is not None:
2025-05-07T20:31:40.0128870Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.0129208Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.0129523Z             )
2025-05-07T20:31:40.0129714Z         else:
2025-05-07T20:31:40.0129933Z             scale_ub_tensor = None
2025-05-07T20:31:40.0130193Z     
2025-05-07T20:31:40.0130422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.0130742Z             op = silu_mul_quant
2025-05-07T20:31:40.0130995Z             if compiled:
2025-05-07T20:31:40.0131237Z                 op = torch.compile(op)
2025-05-07T20:31:40.0131538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.0131816Z     
2025-05-07T20:31:40.0132009Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.0132181Z 
2025-05-07T20:31:40.0132281Z moe/activation_test.py:117: 
2025-05-07T20:31:40.0132588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.0133012Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.0133294Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.0133863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:40.0134431Z     return fn(*args, **kwargs)
2025-05-07T20:31:40.0135085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.0135781Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.0136328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.0137077Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.0137742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.0138288Z     kernel = self.compile(
2025-05-07T20:31:40.0139118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.0139774Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.0140180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.0140415Z 
2025-05-07T20:31:40.0140625Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff4a13d0>
2025-05-07T20:31:40.0141703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.0143056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff74e480>}
2025-05-07T20:31:40.0144518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.0145554Z context = <triton._C.libtriton.ir.context object at 0x7f08ff4d7670>
2025-05-07T20:31:40.0145841Z 
2025-05-07T20:31:40.0146018Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.0146548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.0147059Z                            module_map=module_map)
2025-05-07T20:31:40.0147432Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.0147792Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.0148045Z E       ^
2025-05-07T20:31:40.0148515Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.0148963Z 
2025-05-07T20:31:40.0149395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.0149913Z 
2025-05-07T20:31:40.0802105Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.0802729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.0803485Z     T=16384,
2025-05-07T20:31:40.0803759Z     D=5120,
2025-05-07T20:31:40.0804052Z     scale_ub=None,
2025-05-07T20:31:40.0804341Z     contiguous=False,
2025-05-07T20:31:40.0804664Z     compiled=False,
2025-05-07T20:31:40.0804884Z )
2025-05-07T20:31:40.0805200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.0805706Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:40.0805986Z 
2025-05-07T20:31:40.0806078Z     @given(
2025-05-07T20:31:40.0806305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.0806642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.0807257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.0807584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.0807921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.0808212Z     )
2025-05-07T20:31:40.0808569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.0809009Z     def test_silu_mul_quant(
2025-05-07T20:31:40.0809252Z         self,
2025-05-07T20:31:40.0809457Z         T: int,
2025-05-07T20:31:40.0809653Z         D: int,
2025-05-07T20:31:40.0809878Z         scale_ub: Optional[float],
2025-05-07T20:31:40.0810153Z         contiguous: bool,
2025-05-07T20:31:40.0810388Z         compiled: bool,
2025-05-07T20:31:40.0810616Z     ) -> None:
2025-05-07T20:31:40.0810834Z         torch.manual_seed(2025)
2025-05-07T20:31:40.0811070Z     
2025-05-07T20:31:40.0811345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.0811696Z     
2025-05-07T20:31:40.0811892Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.0812184Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.0814209Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.0816112Z 
2025-05-07T20:31:40.0816232Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:40.0816450Z 
2025-05-07T20:31:40.0816555Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.0817102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.0817510Z     T=4096,
2025-05-07T20:31:40.0817706Z     D=7168,
2025-05-07T20:31:40.0817902Z     scale_ub=1200.0,
2025-05-07T20:31:40.0818123Z     contiguous=True,
2025-05-07T20:31:40.0818348Z     compiled=True,
2025-05-07T20:31:40.0818557Z )
2025-05-07T20:31:40.0818871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.0819369Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:40.0819640Z 
2025-05-07T20:31:40.0819724Z     @given(
2025-05-07T20:31:40.0819960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.0820271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.0820582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.0822412Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.0822739Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.0823033Z     )
2025-05-07T20:31:40.0823399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.0823843Z     def test_silu_mul_quant(
2025-05-07T20:31:40.0824090Z         self,
2025-05-07T20:31:40.0824287Z         T: int,
2025-05-07T20:31:40.0824484Z         D: int,
2025-05-07T20:31:40.0824708Z         scale_ub: Optional[float],
2025-05-07T20:31:40.0824984Z         contiguous: bool,
2025-05-07T20:31:40.0825220Z         compiled: bool,
2025-05-07T20:31:40.0825450Z     ) -> None:
2025-05-07T20:31:40.0825674Z         torch.manual_seed(2025)
2025-05-07T20:31:40.0825922Z     
2025-05-07T20:31:40.0826191Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.0826536Z     
2025-05-07T20:31:40.0826735Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.0827046Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.0829080Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.0832446Z 
2025-05-07T20:31:40.0832565Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:40.0832776Z 
2025-05-07T20:31:40.0832889Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.0833307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.0833708Z     T=16384,
2025-05-07T20:31:40.0833911Z     D=7168,
2025-05-07T20:31:40.0834103Z     scale_ub=None,
2025-05-07T20:31:40.0834318Z     contiguous=False,
2025-05-07T20:31:40.0834547Z     compiled=False,
2025-05-07T20:31:40.0834760Z )
2025-05-07T20:31:40.0835082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.0835583Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:40.0835862Z 
2025-05-07T20:31:40.0835947Z     @given(
2025-05-07T20:31:40.0836175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.0836495Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.0836805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.0837139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.0837464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.0837759Z     )
2025-05-07T20:31:40.0838113Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.0838816Z     def test_silu_mul_quant(
2025-05-07T20:31:40.0839062Z         self,
2025-05-07T20:31:40.0839261Z         T: int,
2025-05-07T20:31:40.0839580Z         D: int,
2025-05-07T20:31:40.0839818Z         scale_ub: Optional[float],
2025-05-07T20:31:40.0840097Z         contiguous: bool,
2025-05-07T20:31:40.0840334Z         compiled: bool,
2025-05-07T20:31:40.0840566Z     ) -> None:
2025-05-07T20:31:40.0840788Z         torch.manual_seed(2025)
2025-05-07T20:31:40.0841027Z     
2025-05-07T20:31:40.0841306Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.0843431Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.0845355Z 
2025-05-07T20:31:40.0845496Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.0845711Z 
2025-05-07T20:31:40.0845823Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.0846241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.0846642Z     T=2048,
2025-05-07T20:31:40.0846839Z     D=7168,
2025-05-07T20:31:40.0847041Z     scale_ub=1200.0,
2025-05-07T20:31:40.0847262Z     contiguous=True,
2025-05-07T20:31:40.0847495Z     compiled=True,
2025-05-07T20:31:40.0847705Z )
2025-05-07T20:31:40.0848023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.0848521Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:40.0848790Z 
2025-05-07T20:31:40.0848880Z     @given(
2025-05-07T20:31:40.0849105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.0849425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.0849743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.0850207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.0850534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.0850825Z     )
2025-05-07T20:31:40.0851181Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.0851612Z     def test_silu_mul_quant(
2025-05-07T20:31:40.0851857Z         self,
2025-05-07T20:31:40.0852051Z         T: int,
2025-05-07T20:31:40.0852241Z         D: int,
2025-05-07T20:31:40.0852458Z         scale_ub: Optional[float],
2025-05-07T20:31:40.0852728Z         contiguous: bool,
2025-05-07T20:31:40.0852960Z         compiled: bool,
2025-05-07T20:31:40.0853180Z     ) -> None:
2025-05-07T20:31:40.0853394Z         torch.manual_seed(2025)
2025-05-07T20:31:40.0853630Z     
2025-05-07T20:31:40.0853903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.0854245Z     
2025-05-07T20:31:40.0854444Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.0854742Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.0856728Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.0858629Z 
2025-05-07T20:31:40.0858746Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:40.0858956Z 
2025-05-07T20:31:40.0859067Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.0859549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.0859957Z     T=2048,
2025-05-07T20:31:40.0860146Z     D=7168,
2025-05-07T20:31:40.0860330Z     scale_ub=None,
2025-05-07T20:31:40.0860550Z     contiguous=True,
2025-05-07T20:31:40.0860771Z     compiled=False,
2025-05-07T20:31:40.0860969Z )
2025-05-07T20:31:40.1750503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.1751232Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:40.1751511Z 
2025-05-07T20:31:40.1751599Z     @given(
2025-05-07T20:31:40.1751866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.1752177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.1752483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.1752817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.1753141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.1753433Z     )
2025-05-07T20:31:40.1753803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.1754261Z     def test_silu_mul_quant(
2025-05-07T20:31:40.1754499Z         self,
2025-05-07T20:31:40.1754696Z         T: int,
2025-05-07T20:31:40.1754898Z         D: int,
2025-05-07T20:31:40.1755114Z         scale_ub: Optional[float],
2025-05-07T20:31:40.1755390Z         contiguous: bool,
2025-05-07T20:31:40.1755632Z         compiled: bool,
2025-05-07T20:31:40.1755854Z     ) -> None:
2025-05-07T20:31:40.1756074Z         torch.manual_seed(2025)
2025-05-07T20:31:40.1756322Z     
2025-05-07T20:31:40.1756590Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.1756938Z     
2025-05-07T20:31:40.1757137Z >       x_sign = torch.sign(x)
2025-05-07T20:31:40.1759082Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.1761870Z 
2025-05-07T20:31:40.1761995Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:40.1762209Z 
2025-05-07T20:31:40.1762313Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.1762725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.1763131Z     T=1,
2025-05-07T20:31:40.1763442Z     D=7168,
2025-05-07T20:31:40.1763637Z     scale_ub=1200.0,
2025-05-07T20:31:40.1763860Z     contiguous=True,
2025-05-07T20:31:40.1764077Z     compiled=False,
2025-05-07T20:31:40.1764286Z )
2025-05-07T20:31:40.1764606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.1765102Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:40.1765374Z 
2025-05-07T20:31:40.1765455Z     @given(
2025-05-07T20:31:40.1765685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.1765999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.1766299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.1766628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.1766957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.1767240Z     )
2025-05-07T20:31:40.1767594Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.1768035Z     def test_silu_mul_quant(
2025-05-07T20:31:40.1768271Z         self,
2025-05-07T20:31:40.1768469Z         T: int,
2025-05-07T20:31:40.1768675Z         D: int,
2025-05-07T20:31:40.1768887Z         scale_ub: Optional[float],
2025-05-07T20:31:40.1769160Z         contiguous: bool,
2025-05-07T20:31:40.1769578Z         compiled: bool,
2025-05-07T20:31:40.1769813Z     ) -> None:
2025-05-07T20:31:40.1770029Z         torch.manual_seed(2025)
2025-05-07T20:31:40.1770270Z     
2025-05-07T20:31:40.1770548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.1770883Z     
2025-05-07T20:31:40.1771088Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.1771384Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.1781029Z         x = x_sign * x_clamp
2025-05-07T20:31:40.1781322Z         x0 = x[:, :D]
2025-05-07T20:31:40.1781550Z         x1 = x[:, D:]
2025-05-07T20:31:40.1781755Z     
2025-05-07T20:31:40.1781949Z         if contiguous:
2025-05-07T20:31:40.1782190Z             x0 = x0.contiguous()
2025-05-07T20:31:40.1782444Z             x1 = x1.contiguous()
2025-05-07T20:31:40.1782689Z     
2025-05-07T20:31:40.1782875Z         if scale_ub is not None:
2025-05-07T20:31:40.1783139Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.1783493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.1783819Z             )
2025-05-07T20:31:40.1784007Z         else:
2025-05-07T20:31:40.1784226Z             scale_ub_tensor = None
2025-05-07T20:31:40.1784483Z     
2025-05-07T20:31:40.1784716Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.1785037Z             op = silu_mul_quant
2025-05-07T20:31:40.1785291Z             if compiled:
2025-05-07T20:31:40.1785543Z                 op = torch.compile(op)
2025-05-07T20:31:40.1785838Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.1786121Z     
2025-05-07T20:31:40.1786316Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.1786480Z 
2025-05-07T20:31:40.1786585Z moe/activation_test.py:117: 
2025-05-07T20:31:40.1786884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.1787219Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.1787500Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.1788325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.1789019Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.1789558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.1790240Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.1790912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.1791450Z     kernel = self.compile(
2025-05-07T20:31:40.1791991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.1792653Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.1793056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.1793288Z 
2025-05-07T20:31:40.1793502Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff4b1710>
2025-05-07T20:31:40.1794577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.1795951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff489d00>}
2025-05-07T20:31:40.1797289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.1798318Z context = <triton._C.libtriton.ir.context object at 0x7f08ff501570>
2025-05-07T20:31:40.1798603Z 
2025-05-07T20:31:40.1798856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.1799383Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.1799853Z                            module_map=module_map)
2025-05-07T20:31:40.1800221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.1800573Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.1800837Z E       ^
2025-05-07T20:31:40.1801304Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.1801762Z 
2025-05-07T20:31:40.1802196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.1802715Z 
2025-05-07T20:31:40.1802821Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.1803239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.1803760Z     T=128,
2025-05-07T20:31:40.1803949Z     D=5120,
2025-05-07T20:31:40.1804141Z     scale_ub=None,
2025-05-07T20:31:40.1804358Z     contiguous=True,
2025-05-07T20:31:40.1804584Z     compiled=False,
2025-05-07T20:31:40.1804781Z )
2025-05-07T20:31:40.2350333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.2350856Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:40.2351149Z 
2025-05-07T20:31:40.2351292Z     @given(
2025-05-07T20:31:40.2351628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.2352073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.2352509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.2352980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.2353337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.2353630Z     )
2025-05-07T20:31:40.2353986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.2354621Z     def test_silu_mul_quant(
2025-05-07T20:31:40.2354868Z         self,
2025-05-07T20:31:40.2355066Z         T: int,
2025-05-07T20:31:40.2355261Z         D: int,
2025-05-07T20:31:40.2355483Z         scale_ub: Optional[float],
2025-05-07T20:31:40.2355764Z         contiguous: bool,
2025-05-07T20:31:40.2355999Z         compiled: bool,
2025-05-07T20:31:40.2356236Z     ) -> None:
2025-05-07T20:31:40.2356453Z         torch.manual_seed(2025)
2025-05-07T20:31:40.2356691Z     
2025-05-07T20:31:40.2356970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.2357317Z     
2025-05-07T20:31:40.2357517Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.2357806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.2358119Z         x = x_sign * x_clamp
2025-05-07T20:31:40.2358360Z         x0 = x[:, :D]
2025-05-07T20:31:40.2358572Z         x1 = x[:, D:]
2025-05-07T20:31:40.2358783Z     
2025-05-07T20:31:40.2358978Z         if contiguous:
2025-05-07T20:31:40.2359216Z             x0 = x0.contiguous()
2025-05-07T20:31:40.2359482Z             x1 = x1.contiguous()
2025-05-07T20:31:40.2359730Z     
2025-05-07T20:31:40.2359915Z         if scale_ub is not None:
2025-05-07T20:31:40.2360186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.2360527Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.2360832Z             )
2025-05-07T20:31:40.2361033Z         else:
2025-05-07T20:31:40.2361245Z             scale_ub_tensor = None
2025-05-07T20:31:40.2361489Z     
2025-05-07T20:31:40.2361724Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.2362043Z             op = silu_mul_quant
2025-05-07T20:31:40.2362293Z             if compiled:
2025-05-07T20:31:40.2362536Z                 op = torch.compile(op)
2025-05-07T20:31:40.2362837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.2363115Z     
2025-05-07T20:31:40.2363536Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.2363718Z 
2025-05-07T20:31:40.2363820Z moe/activation_test.py:117: 
2025-05-07T20:31:40.2364120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.2364453Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.2364739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.2365433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.2366128Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.2366665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.2367358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.2368045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.2368584Z     kernel = self.compile(
2025-05-07T20:31:40.2369154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.2369828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.2370237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.2370468Z 
2025-05-07T20:31:40.2370680Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff51fd90>
2025-05-07T20:31:40.2371797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.2373179Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff48ae80>}
2025-05-07T20:31:40.2374540Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.2375666Z context = <triton._C.libtriton.ir.context object at 0x7f08ff577c30>
2025-05-07T20:31:40.2375956Z 
2025-05-07T20:31:40.2376140Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.2376671Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.2377156Z                            module_map=module_map)
2025-05-07T20:31:40.2377536Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.2377894Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.2378167Z E       ^
2025-05-07T20:31:40.2378654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.2379107Z 
2025-05-07T20:31:40.2379546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.2380075Z 
2025-05-07T20:31:40.2380184Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.2380612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.2381031Z     T=128,
2025-05-07T20:31:40.2381224Z     D=7168,
2025-05-07T20:31:40.2381428Z     scale_ub=None,
2025-05-07T20:31:40.2381644Z     contiguous=True,
2025-05-07T20:31:40.2381868Z     compiled=False,
2025-05-07T20:31:40.2382078Z )
2025-05-07T20:31:40.2382402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.2382897Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:40.2383167Z 
2025-05-07T20:31:40.2383250Z     @given(
2025-05-07T20:31:40.2383483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.2383803Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.2384184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.2384521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.2384850Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.2385129Z     )
2025-05-07T20:31:40.2385480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.2385919Z     def test_silu_mul_quant(
2025-05-07T20:31:40.2386158Z         self,
2025-05-07T20:31:40.2386359Z         T: int,
2025-05-07T20:31:40.2386559Z         D: int,
2025-05-07T20:31:40.2386810Z         scale_ub: Optional[float],
2025-05-07T20:31:40.2387100Z         contiguous: bool,
2025-05-07T20:31:40.2387342Z         compiled: bool,
2025-05-07T20:31:40.2387570Z     ) -> None:
2025-05-07T20:31:40.2387781Z         torch.manual_seed(2025)
2025-05-07T20:31:40.2388029Z     
2025-05-07T20:31:40.2388320Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.2388665Z     
2025-05-07T20:31:40.2388879Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.2389180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.2389490Z         x = x_sign * x_clamp
2025-05-07T20:31:40.2389744Z         x0 = x[:, :D]
2025-05-07T20:31:40.2389976Z         x1 = x[:, D:]
2025-05-07T20:31:40.2390188Z     
2025-05-07T20:31:40.2390384Z         if contiguous:
2025-05-07T20:31:40.2390624Z             x0 = x0.contiguous()
2025-05-07T20:31:40.2390885Z             x1 = x1.contiguous()
2025-05-07T20:31:40.2391137Z     
2025-05-07T20:31:40.2391338Z         if scale_ub is not None:
2025-05-07T20:31:40.2391611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.2391958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.2392283Z             )
2025-05-07T20:31:40.2392487Z         else:
2025-05-07T20:31:40.2392698Z             scale_ub_tensor = None
2025-05-07T20:31:40.2392955Z     
2025-05-07T20:31:40.2393210Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.2393610Z             op = silu_mul_quant
2025-05-07T20:31:40.2393868Z             if compiled:
2025-05-07T20:31:40.2394129Z                 op = torch.compile(op)
2025-05-07T20:31:40.2394429Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.2394723Z     
2025-05-07T20:31:40.2394924Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.2395087Z 
2025-05-07T20:31:40.2395187Z moe/activation_test.py:117: 
2025-05-07T20:31:40.2395487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.2395819Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.2396098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.2396797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.2397500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.2398045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.2398747Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.2399426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.2399970Z     kernel = self.compile(
2025-05-07T20:31:40.2400515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.2401183Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.2401584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.2401813Z 
2025-05-07T20:31:40.2402033Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff54a650>
2025-05-07T20:31:40.2403218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.2404687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff48bec0>}
2025-05-07T20:31:40.2406035Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.2407064Z context = <triton._C.libtriton.ir.context object at 0x7f08ff54e4b0>
2025-05-07T20:31:40.2407351Z 
2025-05-07T20:31:40.2407527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.2408047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.2408515Z                            module_map=module_map)
2025-05-07T20:31:40.2408890Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.2409244Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.2409510Z E       ^
2025-05-07T20:31:40.2409993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.2410450Z 
2025-05-07T20:31:40.2410885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.2411402Z 
2025-05-07T20:31:40.2411507Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.2411933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.2412354Z     T=2048,
2025-05-07T20:31:40.2412548Z     D=7168,
2025-05-07T20:31:40.2412761Z     scale_ub=1200.0,
2025-05-07T20:31:40.2412999Z     contiguous=True,
2025-05-07T20:31:40.2413227Z     compiled=False,
2025-05-07T20:31:40.2413453Z )
2025-05-07T20:31:40.3088971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.3090649Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:40.3091221Z 
2025-05-07T20:31:40.3091373Z     @given(
2025-05-07T20:31:40.3091839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.3092466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.3093073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.3093724Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.3094366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.3094938Z     )
2025-05-07T20:31:40.3095638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.3096511Z     def test_silu_mul_quant(
2025-05-07T20:31:40.3096880Z         self,
2025-05-07T20:31:40.3097119Z         T: int,
2025-05-07T20:31:40.3097329Z         D: int,
2025-05-07T20:31:40.3097553Z         scale_ub: Optional[float],
2025-05-07T20:31:40.3097842Z         contiguous: bool,
2025-05-07T20:31:40.3098093Z         compiled: bool,
2025-05-07T20:31:40.3098322Z     ) -> None:
2025-05-07T20:31:40.3098545Z         torch.manual_seed(2025)
2025-05-07T20:31:40.3098792Z     
2025-05-07T20:31:40.3099064Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.3101131Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.3102992Z 
2025-05-07T20:31:40.3103114Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.3103550Z 
2025-05-07T20:31:40.3103737Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.3104161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.3104572Z     T=1,
2025-05-07T20:31:40.3104764Z     D=5120,
2025-05-07T20:31:40.3104963Z     scale_ub=1200.0,
2025-05-07T20:31:40.3105183Z     contiguous=True,
2025-05-07T20:31:40.3105410Z     compiled=False,
2025-05-07T20:31:40.3105625Z )
2025-05-07T20:31:40.3105943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.3106437Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:40.3106706Z 
2025-05-07T20:31:40.3106793Z     @given(
2025-05-07T20:31:40.3107020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.3107350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.3107658Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.3107993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.3108329Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.3108619Z     )
2025-05-07T20:31:40.3108975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.3109419Z     def test_silu_mul_quant(
2025-05-07T20:31:40.3109666Z         self,
2025-05-07T20:31:40.3109867Z         T: int,
2025-05-07T20:31:40.3110068Z         D: int,
2025-05-07T20:31:40.3110296Z         scale_ub: Optional[float],
2025-05-07T20:31:40.3110572Z         contiguous: bool,
2025-05-07T20:31:40.3110808Z         compiled: bool,
2025-05-07T20:31:40.3111036Z     ) -> None:
2025-05-07T20:31:40.3111257Z         torch.manual_seed(2025)
2025-05-07T20:31:40.3111501Z     
2025-05-07T20:31:40.3111778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.3112121Z     
2025-05-07T20:31:40.3112312Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.3112609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.3113024Z         x = x_sign * x_clamp
2025-05-07T20:31:40.3113263Z         x0 = x[:, :D]
2025-05-07T20:31:40.3113482Z         x1 = x[:, D:]
2025-05-07T20:31:40.3113706Z     
2025-05-07T20:31:40.3113893Z         if contiguous:
2025-05-07T20:31:40.3114151Z             x0 = x0.contiguous()
2025-05-07T20:31:40.3114533Z             x1 = x1.contiguous()
2025-05-07T20:31:40.3114797Z     
2025-05-07T20:31:40.3114986Z         if scale_ub is not None:
2025-05-07T20:31:40.3115266Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.3115613Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.3115922Z             )
2025-05-07T20:31:40.3116119Z         else:
2025-05-07T20:31:40.3116335Z             scale_ub_tensor = None
2025-05-07T20:31:40.3116583Z     
2025-05-07T20:31:40.3116818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.3117134Z             op = silu_mul_quant
2025-05-07T20:31:40.3117379Z             if compiled:
2025-05-07T20:31:40.3117644Z                 op = torch.compile(op)
2025-05-07T20:31:40.3117944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.3118215Z     
2025-05-07T20:31:40.3118409Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.3118581Z 
2025-05-07T20:31:40.3118680Z moe/activation_test.py:117: 
2025-05-07T20:31:40.3118979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.3119308Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.3119591Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.3120282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.3120970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.3121516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.3122313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.3122993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.3123633Z     kernel = self.compile(
2025-05-07T20:31:40.3124183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.3124851Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.3125247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.3125483Z 
2025-05-07T20:31:40.3125694Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff355e10>
2025-05-07T20:31:40.3126798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.3128169Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff5b94e0>}
2025-05-07T20:31:40.3129516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.3130538Z context = <triton._C.libtriton.ir.context object at 0x7f08ff351cf0>
2025-05-07T20:31:40.3130834Z 
2025-05-07T20:31:40.3131002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.3131530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.3132000Z                            module_map=module_map)
2025-05-07T20:31:40.3132358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.3132715Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.3132982Z E       ^
2025-05-07T20:31:40.3133535Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.3133995Z 
2025-05-07T20:31:40.3134414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.3134932Z 
2025-05-07T20:31:40.3135038Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.3135453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.3135849Z     T=2048,
2025-05-07T20:31:40.3136041Z     D=5120,
2025-05-07T20:31:40.3136237Z     scale_ub=None,
2025-05-07T20:31:40.3136448Z     contiguous=True,
2025-05-07T20:31:40.3136676Z     compiled=False,
2025-05-07T20:31:40.3136877Z )
2025-05-07T20:31:40.3137192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.3137688Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:40.3137968Z 
2025-05-07T20:31:40.3138053Z     @given(
2025-05-07T20:31:40.3138281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.3138767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.3139071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.3139394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.3139715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.3140003Z     )
2025-05-07T20:31:40.3140354Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.3140789Z     def test_silu_mul_quant(
2025-05-07T20:31:40.3141027Z         self,
2025-05-07T20:31:40.3141221Z         T: int,
2025-05-07T20:31:40.3141413Z         D: int,
2025-05-07T20:31:40.3141633Z         scale_ub: Optional[float],
2025-05-07T20:31:40.3141903Z         contiguous: bool,
2025-05-07T20:31:40.3142137Z         compiled: bool,
2025-05-07T20:31:40.3142358Z     ) -> None:
2025-05-07T20:31:40.3142703Z         torch.manual_seed(2025)
2025-05-07T20:31:40.3142949Z     
2025-05-07T20:31:40.3143212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.3143554Z     
2025-05-07T20:31:40.3143747Z >       x_sign = torch.sign(x)
2025-05-07T20:31:40.3145687Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.3147600Z 
2025-05-07T20:31:40.3147716Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:40.3147935Z 
2025-05-07T20:31:40.3148043Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.3148461Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.3148866Z     T=16384,
2025-05-07T20:31:40.3149052Z     D=5120,
2025-05-07T20:31:40.3149242Z     scale_ub=None,
2025-05-07T20:31:40.3149453Z     contiguous=True,
2025-05-07T20:31:40.3149671Z     compiled=False,
2025-05-07T20:31:40.3149871Z )
2025-05-07T20:31:40.3845661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.3846260Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:40.3846642Z 
2025-05-07T20:31:40.3846724Z     @given(
2025-05-07T20:31:40.3847029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.3847353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.3847660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.3847991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.3848335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.3848820Z     )
2025-05-07T20:31:40.3849164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.3849605Z     def test_silu_mul_quant(
2025-05-07T20:31:40.3849845Z         self,
2025-05-07T20:31:40.3850041Z         T: int,
2025-05-07T20:31:40.3858165Z         D: int,
2025-05-07T20:31:40.3858393Z         scale_ub: Optional[float],
2025-05-07T20:31:40.3858659Z         contiguous: bool,
2025-05-07T20:31:40.3858899Z         compiled: bool,
2025-05-07T20:31:40.3859116Z     ) -> None:
2025-05-07T20:31:40.3859328Z         torch.manual_seed(2025)
2025-05-07T20:31:40.3859566Z     
2025-05-07T20:31:40.3859837Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.3861900Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.3863765Z 
2025-05-07T20:31:40.3863897Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.3864107Z 
2025-05-07T20:31:40.3864207Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.3864612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.3865010Z     T=4096,
2025-05-07T20:31:40.3865188Z     D=5120,
2025-05-07T20:31:40.3865375Z     scale_ub=None,
2025-05-07T20:31:40.3865585Z     contiguous=True,
2025-05-07T20:31:40.3865802Z     compiled=False,
2025-05-07T20:31:40.3866004Z )
2025-05-07T20:31:40.3866477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.3866972Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:40.3867246Z 
2025-05-07T20:31:40.3867323Z     @given(
2025-05-07T20:31:40.3867555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.3867865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.3868158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.3868480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.3868803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.3869076Z     )
2025-05-07T20:31:40.3869421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.3869858Z     def test_silu_mul_quant(
2025-05-07T20:31:40.3870093Z         self,
2025-05-07T20:31:40.3870284Z         T: int,
2025-05-07T20:31:40.3870476Z         D: int,
2025-05-07T20:31:40.3870684Z         scale_ub: Optional[float],
2025-05-07T20:31:40.3870960Z         contiguous: bool,
2025-05-07T20:31:40.3871196Z         compiled: bool,
2025-05-07T20:31:40.3871418Z     ) -> None:
2025-05-07T20:31:40.3871625Z         torch.manual_seed(2025)
2025-05-07T20:31:40.3871864Z     
2025-05-07T20:31:40.3872133Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.3874157Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.3876008Z 
2025-05-07T20:31:40.3876129Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.3876428Z 
2025-05-07T20:31:40.3876528Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.3876956Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.3877391Z     T=2048,
2025-05-07T20:31:40.3877569Z     D=5120,
2025-05-07T20:31:40.3877753Z     scale_ub=None,
2025-05-07T20:31:40.3877962Z     contiguous=False,
2025-05-07T20:31:40.3878178Z     compiled=False,
2025-05-07T20:31:40.3878378Z )
2025-05-07T20:31:40.3878692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.3879183Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:40.3879459Z 
2025-05-07T20:31:40.3879535Z     @given(
2025-05-07T20:31:40.3879761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.3880061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.3880370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.3880703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.3881031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.3881311Z     )
2025-05-07T20:31:40.3881650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.3882084Z     def test_silu_mul_quant(
2025-05-07T20:31:40.3882316Z         self,
2025-05-07T20:31:40.3882511Z         T: int,
2025-05-07T20:31:40.3882702Z         D: int,
2025-05-07T20:31:40.3882906Z         scale_ub: Optional[float],
2025-05-07T20:31:40.3883171Z         contiguous: bool,
2025-05-07T20:31:40.3883587Z         compiled: bool,
2025-05-07T20:31:40.3883799Z     ) -> None:
2025-05-07T20:31:40.3884008Z         torch.manual_seed(2025)
2025-05-07T20:31:40.3884243Z     
2025-05-07T20:31:40.3884504Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.3886634Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.3888535Z 
2025-05-07T20:31:40.3888652Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.3888872Z 
2025-05-07T20:31:40.3888971Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.3889372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.3889766Z     T=4096,
2025-05-07T20:31:40.3889940Z     D=7168,
2025-05-07T20:31:40.3890119Z     scale_ub=None,
2025-05-07T20:31:40.3890324Z     contiguous=True,
2025-05-07T20:31:40.3890532Z     compiled=True,
2025-05-07T20:31:40.3890733Z )
2025-05-07T20:31:40.3892480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.3892960Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.3893231Z 
2025-05-07T20:31:40.3893307Z     @given(
2025-05-07T20:31:40.3893529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.3893828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.3894128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.3894446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.3894772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.3895047Z     )
2025-05-07T20:31:40.3895388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.3895820Z     def test_silu_mul_quant(
2025-05-07T20:31:40.3896050Z         self,
2025-05-07T20:31:40.3896234Z         T: int,
2025-05-07T20:31:40.3896422Z         D: int,
2025-05-07T20:31:40.3896718Z         scale_ub: Optional[float],
2025-05-07T20:31:40.3897008Z         contiguous: bool,
2025-05-07T20:31:40.3897264Z         compiled: bool,
2025-05-07T20:31:40.3897476Z     ) -> None:
2025-05-07T20:31:40.3897685Z         torch.manual_seed(2025)
2025-05-07T20:31:40.3897918Z     
2025-05-07T20:31:40.3898178Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.3900217Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.3902077Z 
2025-05-07T20:31:40.3902191Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.3902409Z 
2025-05-07T20:31:40.3902509Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.3902913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.3903305Z     T=2048,
2025-05-07T20:31:40.3903484Z     D=5120,
2025-05-07T20:31:40.3903668Z     scale_ub=1200.0,
2025-05-07T20:31:40.3903882Z     contiguous=False,
2025-05-07T20:31:40.3904101Z     compiled=False,
2025-05-07T20:31:40.3904296Z )
2025-05-07T20:31:40.3904607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.3905096Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:40.3905375Z 
2025-05-07T20:31:40.3905452Z     @given(
2025-05-07T20:31:40.3905675Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.3905976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.3906353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.3906679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.3906993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.3907272Z     )
2025-05-07T20:31:40.3907614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.3908041Z     def test_silu_mul_quant(
2025-05-07T20:31:40.3908277Z         self,
2025-05-07T20:31:40.3908460Z         T: int,
2025-05-07T20:31:40.3908651Z         D: int,
2025-05-07T20:31:40.3908854Z         scale_ub: Optional[float],
2025-05-07T20:31:40.3909120Z         contiguous: bool,
2025-05-07T20:31:40.3909346Z         compiled: bool,
2025-05-07T20:31:40.3909555Z     ) -> None:
2025-05-07T20:31:40.3909768Z         torch.manual_seed(2025)
2025-05-07T20:31:40.3909998Z     
2025-05-07T20:31:40.3910256Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.3912290Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.3914144Z 
2025-05-07T20:31:40.3914259Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.3914469Z 
2025-05-07T20:31:40.3914574Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.3914975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.3915364Z     T=4096,
2025-05-07T20:31:40.3915542Z     D=7168,
2025-05-07T20:31:40.3915727Z     scale_ub=1200.0,
2025-05-07T20:31:40.3915949Z     contiguous=True,
2025-05-07T20:31:40.3916250Z     compiled=False,
2025-05-07T20:31:40.3916447Z )
2025-05-07T20:31:40.4824645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.4825255Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:40.4825659Z 
2025-05-07T20:31:40.4825767Z     @given(
2025-05-07T20:31:40.4826077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.4826501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.4826913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.4827354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.4827778Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.4828073Z     )
2025-05-07T20:31:40.4828419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.4828856Z     def test_silu_mul_quant(
2025-05-07T20:31:40.4829089Z         self,
2025-05-07T20:31:40.4829292Z         T: int,
2025-05-07T20:31:40.4829484Z         D: int,
2025-05-07T20:31:40.4829691Z         scale_ub: Optional[float],
2025-05-07T20:31:40.4829960Z         contiguous: bool,
2025-05-07T20:31:40.4830197Z         compiled: bool,
2025-05-07T20:31:40.4830413Z     ) -> None:
2025-05-07T20:31:40.4830624Z         torch.manual_seed(2025)
2025-05-07T20:31:40.4830862Z     
2025-05-07T20:31:40.4831132Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.4833337Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.4835211Z 
2025-05-07T20:31:40.4835327Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.4835537Z 
2025-05-07T20:31:40.4835646Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.4836052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.4836457Z     T=16384,
2025-05-07T20:31:40.4836649Z     D=7168,
2025-05-07T20:31:40.4836840Z     scale_ub=None,
2025-05-07T20:31:40.4837048Z     contiguous=False,
2025-05-07T20:31:40.4837270Z     compiled=True,
2025-05-07T20:31:40.4837476Z )
2025-05-07T20:31:40.4837784Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.4838283Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:40.4838763Z 
2025-05-07T20:31:40.4838848Z     @given(
2025-05-07T20:31:40.4839071Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.4839394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.4839697Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.4840014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.4840339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.4840618Z     )
2025-05-07T20:31:40.4840966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.4841395Z     def test_silu_mul_quant(
2025-05-07T20:31:40.4841638Z         self,
2025-05-07T20:31:40.4841827Z         T: int,
2025-05-07T20:31:40.4842019Z         D: int,
2025-05-07T20:31:40.4842230Z         scale_ub: Optional[float],
2025-05-07T20:31:40.4842498Z         contiguous: bool,
2025-05-07T20:31:40.4842728Z         compiled: bool,
2025-05-07T20:31:40.4842945Z     ) -> None:
2025-05-07T20:31:40.4843159Z         torch.manual_seed(2025)
2025-05-07T20:31:40.4843496Z     
2025-05-07T20:31:40.4843776Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.4845979Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.4847830Z 
2025-05-07T20:31:40.4847947Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.4848156Z 
2025-05-07T20:31:40.4848259Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.4848661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.4849062Z     T=4096,
2025-05-07T20:31:40.4849247Z     D=7168,
2025-05-07T20:31:40.4849440Z     scale_ub=None,
2025-05-07T20:31:40.4849651Z     contiguous=True,
2025-05-07T20:31:40.4849879Z     compiled=False,
2025-05-07T20:31:40.4850077Z )
2025-05-07T20:31:40.4850393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.4850879Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:40.4851147Z 
2025-05-07T20:31:40.4851229Z     @given(
2025-05-07T20:31:40.4851449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.4851751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.4852052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.4852374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.4852707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.4852987Z     )
2025-05-07T20:31:40.4853325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.4853877Z     def test_silu_mul_quant(
2025-05-07T20:31:40.4854125Z         self,
2025-05-07T20:31:40.4854310Z         T: int,
2025-05-07T20:31:40.4854502Z         D: int,
2025-05-07T20:31:40.4854720Z         scale_ub: Optional[float],
2025-05-07T20:31:40.4854981Z         contiguous: bool,
2025-05-07T20:31:40.4855222Z         compiled: bool,
2025-05-07T20:31:40.4855468Z     ) -> None:
2025-05-07T20:31:40.4855682Z         torch.manual_seed(2025)
2025-05-07T20:31:40.4855919Z     
2025-05-07T20:31:40.4856187Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.4858288Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.4860148Z 
2025-05-07T20:31:40.4860263Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.4860477Z 
2025-05-07T20:31:40.4860582Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.4860982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.4861379Z     T=16384,
2025-05-07T20:31:40.4861566Z     D=7168,
2025-05-07T20:31:40.4861751Z     scale_ub=None,
2025-05-07T20:31:40.4861959Z     contiguous=True,
2025-05-07T20:31:40.4862174Z     compiled=False,
2025-05-07T20:31:40.4862372Z )
2025-05-07T20:31:40.4862682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.4863176Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:40.4863449Z 
2025-05-07T20:31:40.4863530Z     @given(
2025-05-07T20:31:40.4863753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.4864151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.4864449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.4864768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.4865095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.4865374Z     )
2025-05-07T20:31:40.4865718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.4866154Z     def test_silu_mul_quant(
2025-05-07T20:31:40.4866394Z         self,
2025-05-07T20:31:40.4866587Z         T: int,
2025-05-07T20:31:40.4866772Z         D: int,
2025-05-07T20:31:40.4867012Z         scale_ub: Optional[float],
2025-05-07T20:31:40.4867301Z         contiguous: bool,
2025-05-07T20:31:40.4867529Z         compiled: bool,
2025-05-07T20:31:40.4867745Z     ) -> None:
2025-05-07T20:31:40.4867956Z         torch.manual_seed(2025)
2025-05-07T20:31:40.4868189Z     
2025-05-07T20:31:40.4868467Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.4870497Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.4872344Z 
2025-05-07T20:31:40.4872464Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.4872672Z 
2025-05-07T20:31:40.4872777Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.4873181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.4873666Z     T=16384,
2025-05-07T20:31:40.4873858Z     D=7168,
2025-05-07T20:31:40.4874041Z     scale_ub=1200.0,
2025-05-07T20:31:40.4874262Z     contiguous=True,
2025-05-07T20:31:40.4874483Z     compiled=False,
2025-05-07T20:31:40.4874681Z )
2025-05-07T20:31:40.4874998Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.4875496Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:40.4875775Z 
2025-05-07T20:31:40.4875852Z     @given(
2025-05-07T20:31:40.4876079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.4876385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.4876689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.4877038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.4877388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.4877668Z     )
2025-05-07T20:31:40.4878014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.4878453Z     def test_silu_mul_quant(
2025-05-07T20:31:40.4878685Z         self,
2025-05-07T20:31:40.4878872Z         T: int,
2025-05-07T20:31:40.4879065Z         D: int,
2025-05-07T20:31:40.4879275Z         scale_ub: Optional[float],
2025-05-07T20:31:40.4879537Z         contiguous: bool,
2025-05-07T20:31:40.4879771Z         compiled: bool,
2025-05-07T20:31:40.4879994Z     ) -> None:
2025-05-07T20:31:40.4880202Z         torch.manual_seed(2025)
2025-05-07T20:31:40.4880440Z     
2025-05-07T20:31:40.4880709Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.4882745Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.4884772Z 
2025-05-07T20:31:40.4884894Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.4885103Z 
2025-05-07T20:31:40.4885203Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.4885610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.4886006Z     T=128,
2025-05-07T20:31:40.4886184Z     D=5120,
2025-05-07T20:31:40.4886371Z     scale_ub=1200.0,
2025-05-07T20:31:40.4886591Z     contiguous=False,
2025-05-07T20:31:40.4886807Z     compiled=False,
2025-05-07T20:31:40.4887000Z )
2025-05-07T20:31:40.5922556Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.5923366Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:40.5923789Z 
2025-05-07T20:31:40.5923905Z     @given(
2025-05-07T20:31:40.5924225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.5924662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.5925076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.5925526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.5925932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.5926217Z     )
2025-05-07T20:31:40.5926559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.5927023Z     def test_silu_mul_quant(
2025-05-07T20:31:40.5927291Z         self,
2025-05-07T20:31:40.5927478Z         T: int,
2025-05-07T20:31:40.5927670Z         D: int,
2025-05-07T20:31:40.5927881Z         scale_ub: Optional[float],
2025-05-07T20:31:40.5928145Z         contiguous: bool,
2025-05-07T20:31:40.5928380Z         compiled: bool,
2025-05-07T20:31:40.5928597Z     ) -> None:
2025-05-07T20:31:40.5929039Z         torch.manual_seed(2025)
2025-05-07T20:31:40.5929281Z     
2025-05-07T20:31:40.5929548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.5929880Z     
2025-05-07T20:31:40.5930066Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.5930371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.5930673Z         x = x_sign * x_clamp
2025-05-07T20:31:40.5930911Z         x0 = x[:, :D]
2025-05-07T20:31:40.5931118Z         x1 = x[:, D:]
2025-05-07T20:31:40.5931318Z     
2025-05-07T20:31:40.5931503Z         if contiguous:
2025-05-07T20:31:40.5931727Z             x0 = x0.contiguous()
2025-05-07T20:31:40.5931984Z             x1 = x1.contiguous()
2025-05-07T20:31:40.5932220Z     
2025-05-07T20:31:40.5932402Z         if scale_ub is not None:
2025-05-07T20:31:40.5932671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.5933002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.5933309Z             )
2025-05-07T20:31:40.5933505Z         else:
2025-05-07T20:31:40.5933708Z             scale_ub_tensor = None
2025-05-07T20:31:40.5933949Z     
2025-05-07T20:31:40.5934177Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.5934486Z             op = silu_mul_quant
2025-05-07T20:31:40.5934740Z             if compiled:
2025-05-07T20:31:40.5934984Z                 op = torch.compile(op)
2025-05-07T20:31:40.5935275Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.5935545Z     
2025-05-07T20:31:40.5935727Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.5935893Z 
2025-05-07T20:31:40.5935990Z moe/activation_test.py:117: 
2025-05-07T20:31:40.5936279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.5936607Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.5936895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.5937597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.5938651Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.5939237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.5939918Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.5940588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.5941119Z     kernel = self.compile(
2025-05-07T20:31:40.5941660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.5942316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.5942705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.5942932Z 
2025-05-07T20:31:40.5943144Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff156590>
2025-05-07T20:31:40.5944228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.5945587Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff150220>}
2025-05-07T20:31:40.5946934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.5947964Z context = <triton._C.libtriton.ir.context object at 0x7f08ff1621b0>
2025-05-07T20:31:40.5948253Z 
2025-05-07T20:31:40.5948422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.5949256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.5957414Z                            module_map=module_map)
2025-05-07T20:31:40.5957788Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.5958139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.5958393Z E       ^
2025-05-07T20:31:40.5958860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.5959311Z 
2025-05-07T20:31:40.5959734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.5960253Z 
2025-05-07T20:31:40.5960354Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.5960765Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.5961167Z     T=2048,
2025-05-07T20:31:40.5961347Z     D=7168,
2025-05-07T20:31:40.5961538Z     scale_ub=None,
2025-05-07T20:31:40.5961757Z     contiguous=False,
2025-05-07T20:31:40.5961970Z     compiled=False,
2025-05-07T20:31:40.5962172Z )
2025-05-07T20:31:40.5962484Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.5962967Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:40.5963349Z 
2025-05-07T20:31:40.5963426Z     @given(
2025-05-07T20:31:40.5963656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.5963966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.5964265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.5964591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.5964917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.5965192Z     )
2025-05-07T20:31:40.5965539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.5965977Z     def test_silu_mul_quant(
2025-05-07T20:31:40.5966380Z         self,
2025-05-07T20:31:40.5966565Z         T: int,
2025-05-07T20:31:40.5966755Z         D: int,
2025-05-07T20:31:40.5966962Z         scale_ub: Optional[float],
2025-05-07T20:31:40.5967229Z         contiguous: bool,
2025-05-07T20:31:40.5967469Z         compiled: bool,
2025-05-07T20:31:40.5967693Z     ) -> None:
2025-05-07T20:31:40.5967893Z         torch.manual_seed(2025)
2025-05-07T20:31:40.5968127Z     
2025-05-07T20:31:40.5968392Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.5970433Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.5972289Z 
2025-05-07T20:31:40.5972405Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:40.5972618Z 
2025-05-07T20:31:40.5972719Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.5973126Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.5973522Z     T=128,
2025-05-07T20:31:40.5973699Z     D=7168,
2025-05-07T20:31:40.5973884Z     scale_ub=1200.0,
2025-05-07T20:31:40.5974103Z     contiguous=True,
2025-05-07T20:31:40.5974313Z     compiled=True,
2025-05-07T20:31:40.5974510Z )
2025-05-07T20:31:40.6277977Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.6279451Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:40.6280190Z 
2025-05-07T20:31:40.6280402Z     @given(
2025-05-07T20:31:40.6281372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.6282017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.6282606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.6283404Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.6284035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.6284591Z     )
2025-05-07T20:31:40.6285272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.6286126Z     def test_silu_mul_quant(
2025-05-07T20:31:40.6286591Z         self,
2025-05-07T20:31:40.6286884Z         T: int,
2025-05-07T20:31:40.6287073Z         D: int,
2025-05-07T20:31:40.6287288Z         scale_ub: Optional[float],
2025-05-07T20:31:40.6287551Z         contiguous: bool,
2025-05-07T20:31:40.6287783Z         compiled: bool,
2025-05-07T20:31:40.6288010Z     ) -> None:
2025-05-07T20:31:40.6288225Z         torch.manual_seed(2025)
2025-05-07T20:31:40.6288455Z     
2025-05-07T20:31:40.6288730Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.6289065Z     
2025-05-07T20:31:40.6289253Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.6289539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.6289838Z         x = x_sign * x_clamp
2025-05-07T20:31:40.6290072Z         x0 = x[:, :D]
2025-05-07T20:31:40.6290285Z         x1 = x[:, D:]
2025-05-07T20:31:40.6290483Z     
2025-05-07T20:31:40.6290666Z         if contiguous:
2025-05-07T20:31:40.6290893Z             x0 = x0.contiguous()
2025-05-07T20:31:40.6291136Z             x1 = x1.contiguous()
2025-05-07T20:31:40.6291374Z     
2025-05-07T20:31:40.6291558Z         if scale_ub is not None:
2025-05-07T20:31:40.6291824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.6292151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.6292460Z             )
2025-05-07T20:31:40.6292647Z         else:
2025-05-07T20:31:40.6292847Z             scale_ub_tensor = None
2025-05-07T20:31:40.6293226Z     
2025-05-07T20:31:40.6293456Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.6293762Z             op = silu_mul_quant
2025-05-07T20:31:40.6294006Z             if compiled:
2025-05-07T20:31:40.6294253Z                 op = torch.compile(op)
2025-05-07T20:31:40.6294541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.6294808Z     
2025-05-07T20:31:40.6294994Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.6295156Z 
2025-05-07T20:31:40.6295255Z moe/activation_test.py:117: 
2025-05-07T20:31:40.6295544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.6295868Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.6296139Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.6296692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:40.6297247Z     return fn(*args, **kwargs)
2025-05-07T20:31:40.6297909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.6298592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.6299131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.6299806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.6300463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.6300987Z     kernel = self.compile(
2025-05-07T20:31:40.6301527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.6302186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.6302573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.6302884Z 
2025-05-07T20:31:40.6303098Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff17db10>
2025-05-07T20:31:40.6304176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.6305540Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff150860>}
2025-05-07T20:31:40.6306873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.6307888Z context = <triton._C.libtriton.ir.context object at 0x7f08ff1694b0>
2025-05-07T20:31:40.6308178Z 
2025-05-07T20:31:40.6308348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.6308868Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.6309326Z                            module_map=module_map)
2025-05-07T20:31:40.6309680Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.6310028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.6310286Z E       ^
2025-05-07T20:31:40.6310746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.6311195Z 
2025-05-07T20:31:40.6311613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.6312130Z 
2025-05-07T20:31:40.6312233Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.6312639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.6313032Z     T=128,
2025-05-07T20:31:40.6313216Z     D=7168,
2025-05-07T20:31:40.6313486Z     scale_ub=1200.0,
2025-05-07T20:31:40.6313700Z     contiguous=True,
2025-05-07T20:31:40.6313917Z     compiled=False,
2025-05-07T20:31:40.6314115Z )
2025-05-07T20:31:40.6314426Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.6314912Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:40.6315184Z 
2025-05-07T20:31:40.6315257Z     @given(
2025-05-07T20:31:40.6315481Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.6315781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.6316080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.6316403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.6316722Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.6317002Z     )
2025-05-07T20:31:40.6317343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.6317782Z     def test_silu_mul_quant(
2025-05-07T20:31:40.6318015Z         self,
2025-05-07T20:31:40.6318205Z         T: int,
2025-05-07T20:31:40.6318392Z         D: int,
2025-05-07T20:31:40.6318603Z         scale_ub: Optional[float],
2025-05-07T20:31:40.6318862Z         contiguous: bool,
2025-05-07T20:31:40.6319090Z         compiled: bool,
2025-05-07T20:31:40.6319303Z     ) -> None:
2025-05-07T20:31:40.6319518Z         torch.manual_seed(2025)
2025-05-07T20:31:40.6319756Z     
2025-05-07T20:31:40.6320015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.6320354Z     
2025-05-07T20:31:40.6320544Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.6320826Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.6322906Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.6324844Z 
2025-05-07T20:31:40.6324963Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:40.6325184Z 
2025-05-07T20:31:40.6325284Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.6325699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.6326097Z     T=128,
2025-05-07T20:31:40.6326282Z     D=5120,
2025-05-07T20:31:40.6326471Z     scale_ub=1200.0,
2025-05-07T20:31:40.6326685Z     contiguous=True,
2025-05-07T20:31:40.6326899Z     compiled=True,
2025-05-07T20:31:40.6327100Z )
2025-05-07T20:31:40.6327415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.6327897Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:40.6328160Z 
2025-05-07T20:31:40.6328243Z     @given(
2025-05-07T20:31:40.6328462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.6328766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.6329064Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.6329386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.6329703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.6329983Z     )
2025-05-07T20:31:40.6330324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.6330752Z     def test_silu_mul_quant(
2025-05-07T20:31:40.6330989Z         self,
2025-05-07T20:31:40.6331177Z         T: int,
2025-05-07T20:31:40.6331362Z         D: int,
2025-05-07T20:31:40.6331589Z         scale_ub: Optional[float],
2025-05-07T20:31:40.6331864Z         contiguous: bool,
2025-05-07T20:31:40.6332181Z         compiled: bool,
2025-05-07T20:31:40.6332400Z     ) -> None:
2025-05-07T20:31:40.6332614Z         torch.manual_seed(2025)
2025-05-07T20:31:40.6332844Z     
2025-05-07T20:31:40.6333109Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.6333445Z     
2025-05-07T20:31:40.6333628Z >       x_sign = torch.sign(x)
2025-05-07T20:31:40.6335559Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:40.6337461Z 
2025-05-07T20:31:40.6337581Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:40.6337797Z 
2025-05-07T20:31:40.6337898Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.6338305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.6338868Z     T=128,
2025-05-07T20:31:40.6339049Z     D=7168,
2025-05-07T20:31:40.6339234Z     scale_ub=None,
2025-05-07T20:31:40.6339437Z     contiguous=True,
2025-05-07T20:31:40.6339649Z     compiled=True,
2025-05-07T20:31:40.6339843Z )
2025-05-07T20:31:41.1541744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1542264Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.1542533Z 
2025-05-07T20:31:41.1542608Z     @given(
2025-05-07T20:31:41.1542834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1543145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1543728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1544070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1544391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1544676Z     )
2025-05-07T20:31:41.1545022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1545454Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1545698Z         self,
2025-05-07T20:31:41.1545901Z         T: int,
2025-05-07T20:31:41.1546094Z         D: int,
2025-05-07T20:31:41.1546302Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1546569Z         contiguous: bool,
2025-05-07T20:31:41.1546807Z         compiled: bool,
2025-05-07T20:31:41.1547029Z     ) -> None:
2025-05-07T20:31:41.1547240Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1547479Z     
2025-05-07T20:31:41.1547744Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1549833Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.1551679Z 
2025-05-07T20:31:41.1551793Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.1552009Z 
2025-05-07T20:31:41.1608834Z FAILED
2025-05-07T20:31:41.1609079Z 
2025-05-07T20:31:41.1609362Z =================================== FAILURES ===================================
2025-05-07T20:31:41.1609822Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:31:41.1610274Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:31:41.1611087Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:31:41.1611728Z   |     yield
2025-05-07T20:31:41.1612168Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:31:41.1612686Z   |     self._callTestMethod(testMethod)
2025-05-07T20:31:41.1613252Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:31:41.1613796Z   |     if method() is not None:
2025-05-07T20:31:41.1614041Z   |        ^^^^^^^^
2025-05-07T20:31:41.1614684Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:31:41.1615409Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1615708Z   |            ^^^^^^^
2025-05-07T20:31:41.1616270Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:31:41.1616896Z   |     raise the_error_hypothesis_found
2025-05-07T20:31:41.1617329Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:31:41.1617753Z   +-+---------------- 1 ----------------
2025-05-07T20:31:41.1618038Z     | Traceback (most recent call last):
2025-05-07T20:31:41.1618750Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:41.1619533Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1619900Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1621976Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.1624183Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:41.1624623Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1625028Z     |     T=128,
2025-05-07T20:31:41.1625251Z     |     D=7168,
2025-05-07T20:31:41.1625473Z     |     scale_ub=1200.0,
2025-05-07T20:31:41.1625733Z     |     contiguous=True,
2025-05-07T20:31:41.1625979Z     |     compiled=False,
2025-05-07T20:31:41.1626200Z     | )
2025-05-07T20:31:41.1626380Z     | 
2025-05-07T20:31:41.1626913Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:31:41.1627511Z     +---------------- 2 ----------------
2025-05-07T20:31:41.1627797Z     | Traceback (most recent call last):
2025-05-07T20:31:41.1628504Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:41.1629274Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1629649Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1631623Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.1633664Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:41.1634098Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1634500Z     |     T=128,
2025-05-07T20:31:41.1634699Z     |     D=7168,
2025-05-07T20:31:41.1634912Z     |     scale_ub=None,
2025-05-07T20:31:41.1635137Z     |     contiguous=True,
2025-05-07T20:31:41.1635372Z     |     compiled=True,
2025-05-07T20:31:41.1635589Z     | )
2025-05-07T20:31:41.1635758Z     | 
2025-05-07T20:31:41.1636274Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:41.1636868Z     +---------------- 3 ----------------
2025-05-07T20:31:41.1637154Z     | Traceback (most recent call last):
2025-05-07T20:31:41.1637866Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:41.1638793Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1639165Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1641136Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.1643216Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:41.1643759Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1644162Z     |     T=128,
2025-05-07T20:31:41.1644361Z     |     D=5120,
2025-05-07T20:31:41.1644566Z     |     scale_ub=1200.0,
2025-05-07T20:31:41.1644803Z     |     contiguous=True,
2025-05-07T20:31:41.1645036Z     |     compiled=True,
2025-05-07T20:31:41.1645249Z     | )
2025-05-07T20:31:41.1645423Z     | 
2025-05-07T20:31:41.1645936Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:31:41.1646527Z     +---------------- 4 ----------------
2025-05-07T20:31:41.1646856Z     | Traceback (most recent call last):
2025-05-07T20:31:41.1647603Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:31:41.1648324Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.1648616Z     |                              ^^^^^^^^
2025-05-07T20:31:41.1649251Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:31:41.1649959Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1650308Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1651148Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:31:41.1651940Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.1652553Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:31:41.1653305Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1653900Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1654545Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:31:41.1655333Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1655811Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1656476Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:31:41.1657287Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1657750Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1658397Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:31:41.1659108Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.1659474Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1660081Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:31:41.1660648Z     |     fn()
2025-05-07T20:31:41.1661212Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:31:41.1661874Z     |     self.fn.run(
2025-05-07T20:31:41.1662402Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:31:41.1681483Z     |     kernel = self.compile(
2025-05-07T20:31:41.1685318Z     |              ^^^^^^^^^^^^^
2025-05-07T20:31:41.1686492Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:31:41.1687677Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1688261Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1689170Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:41.1690285Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1690955Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:41.1691476Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1691962Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.1692330Z     | ^
2025-05-07T20:31:41.1692971Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1693772Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:41.1694351Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:31:41.1695086Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1695689Z     |     T=1,  # or any other generated value
2025-05-07T20:31:41.1696128Z     |     D=5120,  # or any other generated value
2025-05-07T20:31:41.1696608Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:31:41.1697103Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:31:41.1697608Z     |     compiled=True,  # or any other generated value
2025-05-07T20:31:41.1698029Z     | )
2025-05-07T20:31:41.1698270Z     | 
2025-05-07T20:31:41.1699005Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:41.1699975Z     +------------------------------------
2025-05-07T20:31:41.1700481Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:31:41.1700991Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1701569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1702128Z     T=1,
2025-05-07T20:31:41.1702375Z     D=5120,
2025-05-07T20:31:41.1702641Z     scale_ub=None,
2025-05-07T20:31:41.1702939Z     contiguous=True,
2025-05-07T20:31:41.1703248Z     compiled=True,
2025-05-07T20:31:41.1703539Z )
2025-05-07T20:31:41.1703984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1704656Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.1705018Z 
2025-05-07T20:31:41.1705126Z     @given(
2025-05-07T20:31:41.1705448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1705904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1706342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1706811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1707296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1707703Z     )
2025-05-07T20:31:41.1708203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1708823Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1709157Z         self,
2025-05-07T20:31:41.1709439Z         T: int,
2025-05-07T20:31:41.1709726Z         D: int,
2025-05-07T20:31:41.1710029Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1710401Z         contiguous: bool,
2025-05-07T20:31:41.1710745Z         compiled: bool,
2025-05-07T20:31:41.1711073Z     ) -> None:
2025-05-07T20:31:41.1711382Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1711735Z     
2025-05-07T20:31:41.1712116Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1712690Z     
2025-05-07T20:31:41.1712965Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1713381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1713806Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1714144Z         x0 = x[:, :D]
2025-05-07T20:31:41.1714450Z         x1 = x[:, D:]
2025-05-07T20:31:41.1714750Z     
2025-05-07T20:31:41.1715020Z         if contiguous:
2025-05-07T20:31:41.1715347Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1715711Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1716066Z     
2025-05-07T20:31:41.1716340Z         if scale_ub is not None:
2025-05-07T20:31:41.1716718Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1717241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1717675Z             )
2025-05-07T20:31:41.1717939Z         else:
2025-05-07T20:31:41.1718222Z             scale_ub_tensor = None
2025-05-07T20:31:41.1718590Z     
2025-05-07T20:31:41.1718927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1719380Z             op = silu_mul_quant
2025-05-07T20:31:41.1719740Z             if compiled:
2025-05-07T20:31:41.1720091Z                 op = torch.compile(op)
2025-05-07T20:31:41.1720511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1720891Z     
2025-05-07T20:31:41.1721156Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.1721560Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.1721969Z     
2025-05-07T20:31:41.1722292Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1722749Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.1723159Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.1723756Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.1724247Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1724679Z     
2025-05-07T20:31:41.1724965Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.1725339Z 
2025-05-07T20:31:41.1725482Z moe/activation_test.py:126: 
2025-05-07T20:31:41.1725895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1726359Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.1726804Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1727895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.1728943Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.1729690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1730632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1731587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.1732597Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1733585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.1734650Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1735627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.1736543Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.1737385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.1738074Z     fn()
2025-05-07T20:31:41.1739853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.1740659Z     self.fn.run(
2025-05-07T20:31:41.1741530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1742294Z     kernel = self.compile(
2025-05-07T20:31:41.1743026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1743902Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1744418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1744734Z 
2025-05-07T20:31:41.1745007Z self = <triton.compiler.compiler.ASTSource object at 0x7f0937530a50>
2025-05-07T20:31:41.1746459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1748335Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09369c3060>}
2025-05-07T20:31:41.1750142Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1751510Z context = <triton._C.libtriton.ir.context object at 0x7f093c146530>
2025-05-07T20:31:41.1751894Z 
2025-05-07T20:31:41.1752113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1752816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1753427Z                            module_map=module_map)
2025-05-07T20:31:41.1753858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1754287Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.1754609Z E       ^
2025-05-07T20:31:41.1755177Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1758842Z 
2025-05-07T20:31:41.1759364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1760012Z 
2025-05-07T20:31:41.1760136Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1760637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1761125Z     T=2048,
2025-05-07T20:31:41.1761351Z     D=5120,
2025-05-07T20:31:41.1761581Z     scale_ub=1200.0,
2025-05-07T20:31:41.1761841Z     contiguous=True,
2025-05-07T20:31:41.1762112Z     compiled=False,
2025-05-07T20:31:41.1762363Z )
2025-05-07T20:31:41.1762744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1763527Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.1763869Z 
2025-05-07T20:31:41.1763969Z     @given(
2025-05-07T20:31:41.1764251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1764680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1765100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1765538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1765950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1766295Z     )
2025-05-07T20:31:41.1766722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1767268Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1767565Z         self,
2025-05-07T20:31:41.1767795Z         T: int,
2025-05-07T20:31:41.1768021Z         D: int,
2025-05-07T20:31:41.1768283Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1768618Z         contiguous: bool,
2025-05-07T20:31:41.1768916Z         compiled: bool,
2025-05-07T20:31:41.1769187Z     ) -> None:
2025-05-07T20:31:41.1769450Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1769839Z     
2025-05-07T20:31:41.1770173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1770590Z     
2025-05-07T20:31:41.1770826Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1771174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1771551Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1771839Z         x0 = x[:, :D]
2025-05-07T20:31:41.1772110Z         x1 = x[:, D:]
2025-05-07T20:31:41.1772382Z     
2025-05-07T20:31:41.1772611Z         if contiguous:
2025-05-07T20:31:41.1772889Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1773201Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1773499Z     
2025-05-07T20:31:41.1773721Z         if scale_ub is not None:
2025-05-07T20:31:41.1774075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1774500Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1774869Z             )
2025-05-07T20:31:41.1775099Z         else:
2025-05-07T20:31:41.1775369Z             scale_ub_tensor = None
2025-05-07T20:31:41.1775692Z     
2025-05-07T20:31:41.1775973Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1776357Z             op = silu_mul_quant
2025-05-07T20:31:41.1776659Z             if compiled:
2025-05-07T20:31:41.1776948Z                 op = torch.compile(op)
2025-05-07T20:31:41.1777307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1777651Z     
2025-05-07T20:31:41.1777883Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.1778093Z 
2025-05-07T20:31:41.1778223Z moe/activation_test.py:117: 
2025-05-07T20:31:41.1778632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1779090Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.1779480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1780444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.1781470Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.1782191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1783121Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1784033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1784765Z     kernel = self.compile(
2025-05-07T20:31:41.1785520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1786457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1787013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1787336Z 
2025-05-07T20:31:41.1787612Z self = <triton.compiler.compiler.ASTSource object at 0x7f0924577fd0>
2025-05-07T20:31:41.1789107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1791016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09369deac0>}
2025-05-07T20:31:41.1792900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1794331Z context = <triton._C.libtriton.ir.context object at 0x7f093699d330>
2025-05-07T20:31:41.1794720Z 
2025-05-07T20:31:41.1794938Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1795758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1796407Z                            module_map=module_map)
2025-05-07T20:31:41.1796909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1797411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.1797779Z E       ^
2025-05-07T20:31:41.1798410Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1799043Z 
2025-05-07T20:31:41.1799619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1800329Z 
2025-05-07T20:31:41.1800473Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1801040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1801580Z     T=2048,
2025-05-07T20:31:41.1801839Z     D=5120,
2025-05-07T20:31:41.1802099Z     scale_ub=1200.0,
2025-05-07T20:31:41.1802393Z     contiguous=True,
2025-05-07T20:31:41.1802711Z     compiled=True,
2025-05-07T20:31:41.1802996Z )
2025-05-07T20:31:41.1803548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1804235Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.1804609Z 
2025-05-07T20:31:41.1804724Z     @given(
2025-05-07T20:31:41.1805008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1805422Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1805819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1806255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1806686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1807069Z     )
2025-05-07T20:31:41.1807539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1808128Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1808457Z         self,
2025-05-07T20:31:41.1810130Z         T: int,
2025-05-07T20:31:41.1810521Z         D: int,
2025-05-07T20:31:41.1810824Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1811196Z         contiguous: bool,
2025-05-07T20:31:41.1811525Z         compiled: bool,
2025-05-07T20:31:41.1811827Z     ) -> None:
2025-05-07T20:31:41.1812117Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1812447Z     
2025-05-07T20:31:41.1812820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1813286Z     
2025-05-07T20:31:41.1813549Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1813936Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1814355Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1814685Z         x0 = x[:, :D]
2025-05-07T20:31:41.1814982Z         x1 = x[:, D:]
2025-05-07T20:31:41.1815271Z     
2025-05-07T20:31:41.1815533Z         if contiguous:
2025-05-07T20:31:41.1815841Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1816198Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1816540Z     
2025-05-07T20:31:41.1816796Z         if scale_ub is not None:
2025-05-07T20:31:41.1817177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1817634Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1818045Z             )
2025-05-07T20:31:41.1818314Z         else:
2025-05-07T20:31:41.1818601Z             scale_ub_tensor = None
2025-05-07T20:31:41.1818939Z     
2025-05-07T20:31:41.1819245Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1819679Z             op = silu_mul_quant
2025-05-07T20:31:41.1820014Z             if compiled:
2025-05-07T20:31:41.1820346Z                 op = torch.compile(op)
2025-05-07T20:31:41.1820766Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1821163Z     
2025-05-07T20:31:41.1821418Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.1821821Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.1822234Z     
2025-05-07T20:31:41.1822659Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1823128Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.1823515Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.1823928Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.1824420Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1824843Z     
2025-05-07T20:31:41.1825103Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.1825394Z 
2025-05-07T20:31:41.1825529Z moe/activation_test.py:126: 
2025-05-07T20:31:41.1825943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1826399Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.1826837Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1827942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.1829010Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.1829774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1830765Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1831722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.1832739Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1833764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.1834844Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1835851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.1836740Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.1838306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.1839291Z     fn()
2025-05-07T20:31:41.1840012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.1840838Z     self.fn.run(
2025-05-07T20:31:41.1841506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1842294Z     kernel = self.compile(
2025-05-07T20:31:41.1843075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1844109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1844671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1844983Z 
2025-05-07T20:31:41.1845281Z self = <triton.compiler.compiler.ASTSource object at 0x7f093632c690>
2025-05-07T20:31:41.1846747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1848611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f093c5387c0>}
2025-05-07T20:31:41.1850451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1851867Z context = <triton._C.libtriton.ir.context object at 0x7f093632fcb0>
2025-05-07T20:31:41.1852261Z 
2025-05-07T20:31:41.1852785Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1853508Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1854020Z                            module_map=module_map)
2025-05-07T20:31:41.1854385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1854745Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.1855004Z E       ^
2025-05-07T20:31:41.1855472Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1855921Z 
2025-05-07T20:31:41.1856346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1856857Z 
2025-05-07T20:31:41.1856967Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1857372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1857776Z     T=16384,
2025-05-07T20:31:41.1857976Z     D=7168,
2025-05-07T20:31:41.1858166Z     scale_ub=1200.0,
2025-05-07T20:31:41.1858387Z     contiguous=False,
2025-05-07T20:31:41.1858610Z     compiled=False,
2025-05-07T20:31:41.1858806Z )
2025-05-07T20:31:41.1859126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1859627Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.1859905Z 
2025-05-07T20:31:41.1859983Z     @given(
2025-05-07T20:31:41.1860215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1860526Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1860832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1861154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1861487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1861773Z     )
2025-05-07T20:31:41.1862112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1862753Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1862994Z         self,
2025-05-07T20:31:41.1863180Z         T: int,
2025-05-07T20:31:41.1863379Z         D: int,
2025-05-07T20:31:41.1863600Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1863862Z         contiguous: bool,
2025-05-07T20:31:41.1864101Z         compiled: bool,
2025-05-07T20:31:41.1864329Z     ) -> None:
2025-05-07T20:31:41.1864541Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1864782Z     
2025-05-07T20:31:41.1865055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1865403Z     
2025-05-07T20:31:41.1865593Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1865886Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1866198Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1866429Z         x0 = x[:, :D]
2025-05-07T20:31:41.1866646Z         x1 = x[:, D:]
2025-05-07T20:31:41.1866858Z     
2025-05-07T20:31:41.1867034Z         if contiguous:
2025-05-07T20:31:41.1867281Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1867534Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1867771Z     
2025-05-07T20:31:41.1867961Z         if scale_ub is not None:
2025-05-07T20:31:41.1868239Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1868565Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1868876Z             )
2025-05-07T20:31:41.1869070Z         else:
2025-05-07T20:31:41.1869268Z             scale_ub_tensor = None
2025-05-07T20:31:41.1869518Z     
2025-05-07T20:31:41.1869760Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1870070Z             op = silu_mul_quant
2025-05-07T20:31:41.1870316Z             if compiled:
2025-05-07T20:31:41.1870564Z                 op = torch.compile(op)
2025-05-07T20:31:41.1870858Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1871124Z     
2025-05-07T20:31:41.1871313Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.1871572Z 
2025-05-07T20:31:41.1871683Z moe/activation_test.py:117: 
2025-05-07T20:31:41.1871971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1872303Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.1872585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1873268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.1873959Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.1885478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1886183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1886860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1887402Z     kernel = self.compile(
2025-05-07T20:31:41.1887956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1888633Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1889038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1889272Z 
2025-05-07T20:31:41.1889481Z self = <triton.compiler.compiler.ASTSource object at 0x7f09363b03d0>
2025-05-07T20:31:41.1890566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1891936Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09369ddc60>}
2025-05-07T20:31:41.1893287Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1894452Z context = <triton._C.libtriton.ir.context object at 0x7f09363bc2b0>
2025-05-07T20:31:41.1894738Z 
2025-05-07T20:31:41.1894907Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1895441Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1895930Z                            module_map=module_map)
2025-05-07T20:31:41.1896299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1896648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.1896918Z E       ^
2025-05-07T20:31:41.1897438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1897890Z 
2025-05-07T20:31:41.1898314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1898844Z 
2025-05-07T20:31:41.1898946Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1899362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1899768Z     T=1,
2025-05-07T20:31:41.1899947Z     D=7168,
2025-05-07T20:31:41.1900136Z     scale_ub=None,
2025-05-07T20:31:41.1900351Z     contiguous=True,
2025-05-07T20:31:41.1900568Z     compiled=True,
2025-05-07T20:31:41.1900779Z )
2025-05-07T20:31:41.1901099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1901581Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.1901848Z 
2025-05-07T20:31:41.1901927Z     @given(
2025-05-07T20:31:41.1902169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1902477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1902882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1903226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1903557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1903837Z     )
2025-05-07T20:31:41.1904183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1904628Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1904865Z         self,
2025-05-07T20:31:41.1905061Z         T: int,
2025-05-07T20:31:41.1905257Z         D: int,
2025-05-07T20:31:41.1905468Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1905741Z         contiguous: bool,
2025-05-07T20:31:41.1905982Z         compiled: bool,
2025-05-07T20:31:41.1906196Z     ) -> None:
2025-05-07T20:31:41.1906412Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1906655Z     
2025-05-07T20:31:41.1906920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1907262Z     
2025-05-07T20:31:41.1907462Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1907766Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1908067Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1908305Z         x0 = x[:, :D]
2025-05-07T20:31:41.1908518Z         x1 = x[:, D:]
2025-05-07T20:31:41.1908720Z     
2025-05-07T20:31:41.1908907Z         if contiguous:
2025-05-07T20:31:41.1909147Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1909403Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1909645Z     
2025-05-07T20:31:41.1909839Z         if scale_ub is not None:
2025-05-07T20:31:41.1910112Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1910448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1910759Z             )
2025-05-07T20:31:41.1910948Z         else:
2025-05-07T20:31:41.1911166Z             scale_ub_tensor = None
2025-05-07T20:31:41.1911417Z     
2025-05-07T20:31:41.1911647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1911962Z             op = silu_mul_quant
2025-05-07T20:31:41.1912309Z             if compiled:
2025-05-07T20:31:41.1912558Z                 op = torch.compile(op)
2025-05-07T20:31:41.1912857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1913124Z     
2025-05-07T20:31:41.1913314Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.1913598Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.1913882Z     
2025-05-07T20:31:41.1914120Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1914450Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.1914732Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.1915047Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.1915405Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1915707Z     
2025-05-07T20:31:41.1915907Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.1916103Z 
2025-05-07T20:31:41.1916205Z moe/activation_test.py:126: 
2025-05-07T20:31:41.1916508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1916835Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.1917159Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1917950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.1918698Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.1919244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1919928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1920620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.1921470Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1922235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.1922984Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1923825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.1924458Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.1925060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.1925577Z     fn()
2025-05-07T20:31:41.1926078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.1926661Z     self.fn.run(
2025-05-07T20:31:41.1927139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1927676Z     kernel = self.compile(
2025-05-07T20:31:41.1928215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1928871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1929268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1929494Z 
2025-05-07T20:31:41.1929708Z self = <triton.compiler.compiler.ASTSource object at 0x7f090ea37b50>
2025-05-07T20:31:41.1930776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1932150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09367e0360>}
2025-05-07T20:31:41.1933579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1934607Z context = <triton._C.libtriton.ir.context object at 0x7f090ea4be30>
2025-05-07T20:31:41.1934892Z 
2025-05-07T20:31:41.1935058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1935583Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1936055Z                            module_map=module_map)
2025-05-07T20:31:41.1936423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1936772Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.1937061Z E       ^
2025-05-07T20:31:41.1937558Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1938015Z 
2025-05-07T20:31:41.1938761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1939355Z 
2025-05-07T20:31:41.1939482Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1939963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1940370Z     T=4096,
2025-05-07T20:31:41.1940554Z     D=5120,
2025-05-07T20:31:41.1940746Z     scale_ub=None,
2025-05-07T20:31:41.1940956Z     contiguous=False,
2025-05-07T20:31:41.1941170Z     compiled=False,
2025-05-07T20:31:41.1941377Z )
2025-05-07T20:31:41.1941692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1942181Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.1942463Z 
2025-05-07T20:31:41.1942538Z     @given(
2025-05-07T20:31:41.1942767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1943290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1943595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1943921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1944247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1944524Z     )
2025-05-07T20:31:41.1944868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1945307Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1945540Z         self,
2025-05-07T20:31:41.1945733Z         T: int,
2025-05-07T20:31:41.1945922Z         D: int,
2025-05-07T20:31:41.1946130Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1946397Z         contiguous: bool,
2025-05-07T20:31:41.1946636Z         compiled: bool,
2025-05-07T20:31:41.1946849Z     ) -> None:
2025-05-07T20:31:41.1947062Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1947301Z     
2025-05-07T20:31:41.1947575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1947912Z     
2025-05-07T20:31:41.1948107Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1948398Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1948698Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1948940Z         x0 = x[:, :D]
2025-05-07T20:31:41.1949160Z         x1 = x[:, D:]
2025-05-07T20:31:41.1949366Z     
2025-05-07T20:31:41.1949551Z         if contiguous:
2025-05-07T20:31:41.1949782Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1950031Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1950270Z     
2025-05-07T20:31:41.1950460Z         if scale_ub is not None:
2025-05-07T20:31:41.1950730Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1951065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1951375Z             )
2025-05-07T20:31:41.1951560Z         else:
2025-05-07T20:31:41.1951771Z             scale_ub_tensor = None
2025-05-07T20:31:41.1952023Z     
2025-05-07T20:31:41.1952397Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1952707Z             op = silu_mul_quant
2025-05-07T20:31:41.1952957Z             if compiled:
2025-05-07T20:31:41.1953201Z                 op = torch.compile(op)
2025-05-07T20:31:41.1953492Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1953764Z     
2025-05-07T20:31:41.1953951Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.1954110Z 
2025-05-07T20:31:41.1954205Z moe/activation_test.py:117: 
2025-05-07T20:31:41.1954494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1954820Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.1955093Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1955783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.1956476Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.1957013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1957711Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1958384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1958923Z     kernel = self.compile(
2025-05-07T20:31:41.1959464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1960124Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1960523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1960749Z 
2025-05-07T20:31:41.1960955Z self = <triton.compiler.compiler.ASTSource object at 0x7f090eadfed0>
2025-05-07T20:31:41.1962116Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1963596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09367e1c60>}
2025-05-07T20:31:41.1964943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1965973Z context = <triton._C.libtriton.ir.context object at 0x7f090ea6bd30>
2025-05-07T20:31:41.1966259Z 
2025-05-07T20:31:41.1966425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1966948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1967423Z                            module_map=module_map)
2025-05-07T20:31:41.1967792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1968138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.1968397Z E       ^
2025-05-07T20:31:41.1968865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1969314Z 
2025-05-07T20:31:41.1969729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1970260Z 
2025-05-07T20:31:41.1970361Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1970771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1971171Z     T=4096,
2025-05-07T20:31:41.1971349Z     D=7168,
2025-05-07T20:31:41.1971543Z     scale_ub=None,
2025-05-07T20:31:41.1971756Z     contiguous=False,
2025-05-07T20:31:41.1971972Z     compiled=False,
2025-05-07T20:31:41.1972169Z )
2025-05-07T20:31:41.1972491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1973071Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.1973350Z 
2025-05-07T20:31:41.1973426Z     @given(
2025-05-07T20:31:41.1973649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1973956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1974260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1974583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1974910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1975185Z     )
2025-05-07T20:31:41.1975528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1975962Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1976190Z         self,
2025-05-07T20:31:41.1976384Z         T: int,
2025-05-07T20:31:41.1976579Z         D: int,
2025-05-07T20:31:41.1976793Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1977070Z         contiguous: bool,
2025-05-07T20:31:41.1977308Z         compiled: bool,
2025-05-07T20:31:41.1977518Z     ) -> None:
2025-05-07T20:31:41.1977723Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1977953Z     
2025-05-07T20:31:41.1978213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1978554Z     
2025-05-07T20:31:41.1978746Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1979034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1979335Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1979569Z         x0 = x[:, :D]
2025-05-07T20:31:41.1979782Z         x1 = x[:, D:]
2025-05-07T20:31:41.1979983Z     
2025-05-07T20:31:41.1980170Z         if contiguous:
2025-05-07T20:31:41.1980400Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1980647Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1980885Z     
2025-05-07T20:31:41.1982537Z         if scale_ub is not None:
2025-05-07T20:31:41.1982892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1983230Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1983539Z             )
2025-05-07T20:31:41.1983726Z         else:
2025-05-07T20:31:41.1983932Z             scale_ub_tensor = None
2025-05-07T20:31:41.1984182Z     
2025-05-07T20:31:41.1984405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1984716Z             op = silu_mul_quant
2025-05-07T20:31:41.1984960Z             if compiled:
2025-05-07T20:31:41.1985202Z                 op = torch.compile(op)
2025-05-07T20:31:41.1985488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1985760Z     
2025-05-07T20:31:41.1985950Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.1986112Z 
2025-05-07T20:31:41.1986208Z moe/activation_test.py:117: 
2025-05-07T20:31:41.1986497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1986829Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.1987109Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1987802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.1988497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.1989037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1989721Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1990391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1990930Z     kernel = self.compile(
2025-05-07T20:31:41.1992693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1993350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1993490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1993584Z 
2025-05-07T20:31:41.1993792Z self = <triton.compiler.compiler.ASTSource object at 0x7f090eacb810>
2025-05-07T20:31:41.1994572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1995076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09367e2c00>}
2025-05-07T20:31:41.1995828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1996025Z context = <triton._C.libtriton.ir.context object at 0x7f090ea8baf0>
2025-05-07T20:31:41.1996034Z 
2025-05-07T20:31:41.1996198Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1996463Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1996574Z                            module_map=module_map)
2025-05-07T20:31:41.1996733Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1996838Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.1996917Z E       ^
2025-05-07T20:31:41.1997300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1997305Z 
2025-05-07T20:31:41.1997754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1997759Z 
2025-05-07T20:31:41.1997860Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1998192Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1998272Z     T=128,
2025-05-07T20:31:41.1998346Z     D=7168,
2025-05-07T20:31:41.1998433Z     scale_ub=None,
2025-05-07T20:31:41.1998518Z     contiguous=False,
2025-05-07T20:31:41.1998598Z     compiled=True,
2025-05-07T20:31:41.1998677Z )
2025-05-07T20:31:41.1998895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1999064Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.1999068Z 
2025-05-07T20:31:41.1999150Z     @given(
2025-05-07T20:31:41.1999267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1999373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1999487Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1999605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1999725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1999794Z     )
2025-05-07T20:31:41.2000050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2000146Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2000222Z         self,
2025-05-07T20:31:41.2000296Z         T: int,
2025-05-07T20:31:41.2000378Z         D: int,
2025-05-07T20:31:41.2000475Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2000562Z         contiguous: bool,
2025-05-07T20:31:41.2000653Z         compiled: bool,
2025-05-07T20:31:41.2000732Z     ) -> None:
2025-05-07T20:31:41.2000835Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2000908Z     
2025-05-07T20:31:41.2001076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2001153Z     
2025-05-07T20:31:41.2001242Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2001364Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2001455Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2001534Z         x0 = x[:, :D]
2025-05-07T20:31:41.2001620Z         x1 = x[:, D:]
2025-05-07T20:31:41.2001781Z     
2025-05-07T20:31:41.2001866Z         if contiguous:
2025-05-07T20:31:41.2001958Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2002052Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2002124Z     
2025-05-07T20:31:41.2002221Z         if scale_ub is not None:
2025-05-07T20:31:41.2002323Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2002459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2002539Z             )
2025-05-07T20:31:41.2002613Z         else:
2025-05-07T20:31:41.2002701Z             scale_ub_tensor = None
2025-05-07T20:31:41.2002774Z     
2025-05-07T20:31:41.2002904Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2002988Z             op = silu_mul_quant
2025-05-07T20:31:41.2003077Z             if compiled:
2025-05-07T20:31:41.2003172Z                 op = torch.compile(op)
2025-05-07T20:31:41.2003386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2003468Z     
2025-05-07T20:31:41.2003555Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2003681Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2003753Z     
2025-05-07T20:31:41.2003886Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2003995Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2004092Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2004215Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2004361Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2004433Z     
2025-05-07T20:31:41.2004530Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2004534Z 
2025-05-07T20:31:41.2004636Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2004767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2004875Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2005118Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2005689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2005795Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2006157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2006376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2006750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2007003Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2007407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2007663Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2008042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2008211Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2008554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2008636Z     fn()
2025-05-07T20:31:41.2009037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2009115Z     self.fn.run(
2025-05-07T20:31:41.2009463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2009557Z     kernel = self.compile(
2025-05-07T20:31:41.2009937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2010200Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2010326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2010331Z 
2025-05-07T20:31:41.2010542Z self = <triton.compiler.compiler.ASTSource object at 0x7f090e1c3590>
2025-05-07T20:31:41.2011316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2011813Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09367e3f60>}
2025-05-07T20:31:41.2012575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2012771Z context = <triton._C.libtriton.ir.context object at 0x7f090e1e09b0>
2025-05-07T20:31:41.2012775Z 
2025-05-07T20:31:41.2012947Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2013210Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2013323Z                            module_map=module_map)
2025-05-07T20:31:41.2013483Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2013585Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2013666Z E       ^
2025-05-07T20:31:41.2014022Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2014026Z 
2025-05-07T20:31:41.2014491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2014499Z 
2025-05-07T20:31:41.2014719Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2014944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2015027Z     T=128,
2025-05-07T20:31:41.2015100Z     D=7168,
2025-05-07T20:31:41.2015179Z     scale_ub=None,
2025-05-07T20:31:41.2015270Z     contiguous=False,
2025-05-07T20:31:41.2015353Z     compiled=False,
2025-05-07T20:31:41.2015424Z )
2025-05-07T20:31:41.2015648Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2015819Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2015824Z 
2025-05-07T20:31:41.2015901Z     @given(
2025-05-07T20:31:41.2016022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2016120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2016242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2016364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2016480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2016559Z     )
2025-05-07T20:31:41.2016804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2016896Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2016981Z         self,
2025-05-07T20:31:41.2017055Z         T: int,
2025-05-07T20:31:41.2017128Z         D: int,
2025-05-07T20:31:41.2017229Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2017313Z         contiguous: bool,
2025-05-07T20:31:41.2017398Z         compiled: bool,
2025-05-07T20:31:41.2017479Z     ) -> None:
2025-05-07T20:31:41.2017571Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2017645Z     
2025-05-07T20:31:41.2017812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2017887Z     
2025-05-07T20:31:41.2017982Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2018104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2018286Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2018376Z         x0 = x[:, :D]
2025-05-07T20:31:41.2018453Z         x1 = x[:, D:]
2025-05-07T20:31:41.2018525Z     
2025-05-07T20:31:41.2018610Z         if contiguous:
2025-05-07T20:31:41.2018698Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2018785Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2018862Z     
2025-05-07T20:31:41.2018948Z         if scale_ub is not None:
2025-05-07T20:31:41.2019058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2019190Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2019263Z             )
2025-05-07T20:31:41.2019342Z         else:
2025-05-07T20:31:41.2019433Z             scale_ub_tensor = None
2025-05-07T20:31:41.2019505Z     
2025-05-07T20:31:41.2019640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2019727Z             op = silu_mul_quant
2025-05-07T20:31:41.2019809Z             if compiled:
2025-05-07T20:31:41.2019920Z                 op = torch.compile(op)
2025-05-07T20:31:41.2020027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2020097Z     
2025-05-07T20:31:41.2020189Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2020194Z 
2025-05-07T20:31:41.2020289Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2020421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2020520Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2020616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2021122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2021217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2021578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2021892Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2022242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2022338Z     kernel = self.compile(
2025-05-07T20:31:41.2022723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2022895Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2023026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2023031Z 
2025-05-07T20:31:41.2023236Z self = <triton.compiler.compiler.ASTSource object at 0x7f090e28b610>
2025-05-07T20:31:41.2024016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2024520Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090e923ec0>}
2025-05-07T20:31:41.2025279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2025472Z context = <triton._C.libtriton.ir.context object at 0x7f090e2c6030>
2025-05-07T20:31:41.2025477Z 
2025-05-07T20:31:41.2025640Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2025906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2026010Z                            module_map=module_map)
2025-05-07T20:31:41.2026169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2026271Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2026348Z E       ^
2025-05-07T20:31:41.2026709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2026829Z 
2025-05-07T20:31:41.2027284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2027290Z 
2025-05-07T20:31:41.2027400Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2027626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2027702Z     T=4096,
2025-05-07T20:31:41.2027775Z     D=5120,
2025-05-07T20:31:41.2027860Z     scale_ub=1200.0,
2025-05-07T20:31:41.2027943Z     contiguous=True,
2025-05-07T20:31:41.2028023Z     compiled=False,
2025-05-07T20:31:41.2028101Z )
2025-05-07T20:31:41.2028316Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2028496Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2028501Z 
2025-05-07T20:31:41.2028589Z     @given(
2025-05-07T20:31:41.2028710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2028813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2028926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2029039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2029159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2029233Z     )
2025-05-07T20:31:41.2029483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2029576Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2029652Z         self,
2025-05-07T20:31:41.2029731Z         T: int,
2025-05-07T20:31:41.2029805Z         D: int,
2025-05-07T20:31:41.2029903Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2029999Z         contiguous: bool,
2025-05-07T20:31:41.2030081Z         compiled: bool,
2025-05-07T20:31:41.2030157Z     ) -> None:
2025-05-07T20:31:41.2030337Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2030416Z     
2025-05-07T20:31:41.2030586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2037256Z     
2025-05-07T20:31:41.2037371Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2037504Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2037603Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2037683Z         x0 = x[:, :D]
2025-05-07T20:31:41.2037764Z         x1 = x[:, D:]
2025-05-07T20:31:41.2037845Z     
2025-05-07T20:31:41.2037931Z         if contiguous:
2025-05-07T20:31:41.2038024Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2038118Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2038191Z     
2025-05-07T20:31:41.2038288Z         if scale_ub is not None:
2025-05-07T20:31:41.2038648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2038845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2038932Z             )
2025-05-07T20:31:41.2039022Z         else:
2025-05-07T20:31:41.2039120Z             scale_ub_tensor = None
2025-05-07T20:31:41.2039200Z     
2025-05-07T20:31:41.2039337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2039426Z             op = silu_mul_quant
2025-05-07T20:31:41.2039517Z             if compiled:
2025-05-07T20:31:41.2039617Z                 op = torch.compile(op)
2025-05-07T20:31:41.2039719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2039795Z     
2025-05-07T20:31:41.2039885Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2039891Z 
2025-05-07T20:31:41.2039994Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2040124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2040227Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2040332Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2040842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2041200Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2041573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2041794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2042142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2042234Z     kernel = self.compile(
2025-05-07T20:31:41.2042619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2042797Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2042923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2042928Z 
2025-05-07T20:31:41.2043146Z self = <triton.compiler.compiler.ASTSource object at 0x7f090e0b8e50>
2025-05-07T20:31:41.2044067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2044567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090e720720>}
2025-05-07T20:31:41.2045322Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2045509Z context = <triton._C.libtriton.ir.context object at 0x7f090e09ccf0>
2025-05-07T20:31:41.2045514Z 
2025-05-07T20:31:41.2045685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2046064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2046176Z                            module_map=module_map)
2025-05-07T20:31:41.2046343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2046439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2046523Z E       ^
2025-05-07T20:31:41.2046876Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2046881Z 
2025-05-07T20:31:41.2047293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2047298Z 
2025-05-07T20:31:41.2047404Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2047624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2047701Z     T=1,
2025-05-07T20:31:41.2047783Z     D=5120,
2025-05-07T20:31:41.2047862Z     scale_ub=None,
2025-05-07T20:31:41.2047954Z     contiguous=True,
2025-05-07T20:31:41.2048040Z     compiled=True,
2025-05-07T20:31:41.2048114Z )
2025-05-07T20:31:41.2048338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2048497Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2048502Z 
2025-05-07T20:31:41.2048574Z     @given(
2025-05-07T20:31:41.2048696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2048791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2048901Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2049017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2049126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2049202Z     )
2025-05-07T20:31:41.2049444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2049535Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2049614Z         self,
2025-05-07T20:31:41.2049774Z         T: int,
2025-05-07T20:31:41.2049847Z         D: int,
2025-05-07T20:31:41.2049946Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2050033Z         contiguous: bool,
2025-05-07T20:31:41.2050116Z         compiled: bool,
2025-05-07T20:31:41.2050198Z     ) -> None:
2025-05-07T20:31:41.2050291Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2050362Z     
2025-05-07T20:31:41.2050536Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2050609Z     
2025-05-07T20:31:41.2050704Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2050831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2050914Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2050997Z         x0 = x[:, :D]
2025-05-07T20:31:41.2051074Z         x1 = x[:, D:]
2025-05-07T20:31:41.2051145Z     
2025-05-07T20:31:41.2051229Z         if contiguous:
2025-05-07T20:31:41.2051318Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2051409Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2051492Z     
2025-05-07T20:31:41.2051578Z         if scale_ub is not None:
2025-05-07T20:31:41.2051682Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2051820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2051894Z             )
2025-05-07T20:31:41.2051968Z         else:
2025-05-07T20:31:41.2052063Z             scale_ub_tensor = None
2025-05-07T20:31:41.2052135Z     
2025-05-07T20:31:41.2052273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2052361Z             op = silu_mul_quant
2025-05-07T20:31:41.2052443Z             if compiled:
2025-05-07T20:31:41.2052548Z                 op = torch.compile(op)
2025-05-07T20:31:41.2052650Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2052723Z     
2025-05-07T20:31:41.2052821Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2052938Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2053008Z     
2025-05-07T20:31:41.2053237Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2053339Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2053435Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2053560Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2053696Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2053778Z     
2025-05-07T20:31:41.2053874Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2053879Z 
2025-05-07T20:31:41.2053973Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2054106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2054209Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2054340Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2054910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2055015Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2055383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2055604Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2055972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2056235Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2056632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2056894Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2057274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2057517Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2057865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2057941Z     fn()
2025-05-07T20:31:41.2058341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2058429Z     self.fn.run(
2025-05-07T20:31:41.2058769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2058858Z     kernel = self.compile(
2025-05-07T20:31:41.2059239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2059418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2059543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2059553Z 
2025-05-07T20:31:41.2059760Z self = <triton.compiler.compiler.ASTSource object at 0x7f090e042350>
2025-05-07T20:31:41.2060537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2061032Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0924366660>}
2025-05-07T20:31:41.2061785Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2061972Z context = <triton._C.libtriton.ir.context object at 0x7f090e03a230>
2025-05-07T20:31:41.2061976Z 
2025-05-07T20:31:41.2062216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2062489Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2062594Z                            module_map=module_map)
2025-05-07T20:31:41.2062756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2062856Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2062931Z E       ^
2025-05-07T20:31:41.2063289Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2063294Z 
2025-05-07T20:31:41.2063705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2063710Z 
2025-05-07T20:31:41.2063813Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2064031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2064105Z     T=2048,
2025-05-07T20:31:41.2064191Z     D=5120,
2025-05-07T20:31:41.2064271Z     scale_ub=None,
2025-05-07T20:31:41.2064353Z     contiguous=True,
2025-05-07T20:31:41.2064437Z     compiled=True,
2025-05-07T20:31:41.2064509Z )
2025-05-07T20:31:41.2064724Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2064895Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2064900Z 
2025-05-07T20:31:41.2064975Z     @given(
2025-05-07T20:31:41.2065098Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2065194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2065305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2065422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2065533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2065603Z     )
2025-05-07T20:31:41.2065856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2066051Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2066125Z         self,
2025-05-07T20:31:41.2066204Z         T: int,
2025-05-07T20:31:41.2066278Z         D: int,
2025-05-07T20:31:41.2066372Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2066464Z         contiguous: bool,
2025-05-07T20:31:41.2066544Z         compiled: bool,
2025-05-07T20:31:41.2066622Z     ) -> None:
2025-05-07T20:31:41.2066713Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2066784Z     
2025-05-07T20:31:41.2066956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2067028Z     
2025-05-07T20:31:41.2067115Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2067242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2067327Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2067402Z         x0 = x[:, :D]
2025-05-07T20:31:41.2067483Z         x1 = x[:, D:]
2025-05-07T20:31:41.2067554Z     
2025-05-07T20:31:41.2067638Z         if contiguous:
2025-05-07T20:31:41.2067736Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2067821Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2067894Z     
2025-05-07T20:31:41.2067983Z         if scale_ub is not None:
2025-05-07T20:31:41.2068083Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2068220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2068294Z             )
2025-05-07T20:31:41.2068366Z         else:
2025-05-07T20:31:41.2068461Z             scale_ub_tensor = None
2025-05-07T20:31:41.2068533Z     
2025-05-07T20:31:41.2068663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2068753Z             op = silu_mul_quant
2025-05-07T20:31:41.2068834Z             if compiled:
2025-05-07T20:31:41.2068931Z                 op = torch.compile(op)
2025-05-07T20:31:41.2069038Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2069107Z     
2025-05-07T20:31:41.2069198Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2069400Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2069470Z     
2025-05-07T20:31:41.2069610Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2069707Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2069803Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2069929Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2070066Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2070136Z     
2025-05-07T20:31:41.2070240Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2070245Z 
2025-05-07T20:31:41.2070340Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2070470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2070570Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2070699Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2071267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2071370Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2071730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2071957Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2072323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2072580Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2072976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2073228Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2073611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2073855Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2074199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2074272Z     fn()
2025-05-07T20:31:41.2074671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2074754Z     self.fn.run(
2025-05-07T20:31:41.2075092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2075182Z     kernel = self.compile(
2025-05-07T20:31:41.2075567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2075737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2075878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2075883Z 
2025-05-07T20:31:41.2076085Z self = <triton.compiler.compiler.ASTSource object at 0x7f090df2e090>
2025-05-07T20:31:41.2076858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2077354Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f092455a5c0>}
2025-05-07T20:31:41.2078100Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2078370Z context = <triton._C.libtriton.ir.context object at 0x7f090df31730>
2025-05-07T20:31:41.2078385Z 
2025-05-07T20:31:41.2078550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2078811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2078920Z                            module_map=module_map)
2025-05-07T20:31:41.2079079Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2079182Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2079257Z E       ^
2025-05-07T20:31:41.2079606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2079611Z 
2025-05-07T20:31:41.2080028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2080033Z 
2025-05-07T20:31:41.2080133Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2080364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2080443Z     T=128,
2025-05-07T20:31:41.2080517Z     D=5120,
2025-05-07T20:31:41.2080601Z     scale_ub=None,
2025-05-07T20:31:41.2080683Z     contiguous=True,
2025-05-07T20:31:41.2080757Z     compiled=True,
2025-05-07T20:31:41.2080831Z )
2025-05-07T20:31:41.2081047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2081212Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2081217Z 
2025-05-07T20:31:41.2081298Z     @given(
2025-05-07T20:31:41.2081413Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2081516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2081628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2081742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2081855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2081931Z     )
2025-05-07T20:31:41.2082256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2082347Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2082421Z         self,
2025-05-07T20:31:41.2082494Z         T: int,
2025-05-07T20:31:41.2082572Z         D: int,
2025-05-07T20:31:41.2082665Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2082754Z         contiguous: bool,
2025-05-07T20:31:41.2082839Z         compiled: bool,
2025-05-07T20:31:41.2082912Z     ) -> None:
2025-05-07T20:31:41.2083006Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2083076Z     
2025-05-07T20:31:41.2083319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2083396Z     
2025-05-07T20:31:41.2083484Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2083607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2083697Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2083772Z         x0 = x[:, :D]
2025-05-07T20:31:41.2083857Z         x1 = x[:, D:]
2025-05-07T20:31:41.2083934Z     
2025-05-07T20:31:41.2084014Z         if contiguous:
2025-05-07T20:31:41.2084101Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2084189Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2084259Z     
2025-05-07T20:31:41.2084349Z         if scale_ub is not None:
2025-05-07T20:31:41.2084450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2084582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2084663Z             )
2025-05-07T20:31:41.2084737Z         else:
2025-05-07T20:31:41.2084829Z             scale_ub_tensor = None
2025-05-07T20:31:41.2084900Z     
2025-05-07T20:31:41.2085028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2085115Z             op = silu_mul_quant
2025-05-07T20:31:41.2085203Z             if compiled:
2025-05-07T20:31:41.2085298Z                 op = torch.compile(op)
2025-05-07T20:31:41.2085484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2085561Z     
2025-05-07T20:31:41.2085650Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2085773Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2085843Z     
2025-05-07T20:31:41.2085973Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2086082Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2086177Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2086293Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2086431Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2086500Z     
2025-05-07T20:31:41.2086594Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2086599Z 
2025-05-07T20:31:41.2086700Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2086823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2086926Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2087064Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2087626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2087729Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2088088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2088308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2088679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2088932Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2089332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2089589Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2090043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2090207Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2090548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2090626Z     fn()
2025-05-07T20:31:41.2091024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2091101Z     self.fn.run(
2025-05-07T20:31:41.2091442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2091531Z     kernel = self.compile(
2025-05-07T20:31:41.2091913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2092093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2092216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2092221Z 
2025-05-07T20:31:41.2092427Z self = <triton.compiler.compiler.ASTSource object at 0x7f090d71c6d0>
2025-05-07T20:31:41.2093198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2093693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0924558400>}
2025-05-07T20:31:41.2094545Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2094736Z context = <triton._C.libtriton.ir.context object at 0x7f090d71fd70>
2025-05-07T20:31:41.2094741Z 
2025-05-07T20:31:41.2094907Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2095169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2095277Z                            module_map=module_map)
2025-05-07T20:31:41.2095433Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2095530Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2095609Z E       ^
2025-05-07T20:31:41.2095961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2095965Z 
2025-05-07T20:31:41.2096374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2096383Z 
2025-05-07T20:31:41.2096492Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2096711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2096790Z     T=4096,
2025-05-07T20:31:41.2096864Z     D=5120,
2025-05-07T20:31:41.2096941Z     scale_ub=None,
2025-05-07T20:31:41.2097027Z     contiguous=True,
2025-05-07T20:31:41.2097106Z     compiled=True,
2025-05-07T20:31:41.2097177Z )
2025-05-07T20:31:41.2097392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2097564Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2097569Z 
2025-05-07T20:31:41.2097644Z     @given(
2025-05-07T20:31:41.2097763Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2097855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2097965Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2098085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2098275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2098346Z     )
2025-05-07T20:31:41.2098590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2098679Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2098755Z         self,
2025-05-07T20:31:41.2098828Z         T: int,
2025-05-07T20:31:41.2098903Z         D: int,
2025-05-07T20:31:41.2098998Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2099082Z         contiguous: bool,
2025-05-07T20:31:41.2099163Z         compiled: bool,
2025-05-07T20:31:41.2099241Z     ) -> None:
2025-05-07T20:31:41.2099333Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2099398Z     
2025-05-07T20:31:41.2099565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2099635Z     
2025-05-07T20:31:41.2099721Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2099846Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2099943Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2100020Z         x0 = x[:, :D]
2025-05-07T20:31:41.2100100Z         x1 = x[:, D:]
2025-05-07T20:31:41.2100169Z     
2025-05-07T20:31:41.2100251Z         if contiguous:
2025-05-07T20:31:41.2100337Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2100420Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2100496Z     
2025-05-07T20:31:41.2100584Z         if scale_ub is not None:
2025-05-07T20:31:41.2100687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2100820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2100891Z             )
2025-05-07T20:31:41.2100962Z         else:
2025-05-07T20:31:41.2101056Z             scale_ub_tensor = None
2025-05-07T20:31:41.2101126Z     
2025-05-07T20:31:41.2101250Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2101345Z             op = silu_mul_quant
2025-05-07T20:31:41.2101426Z             if compiled:
2025-05-07T20:31:41.2101603Z                 op = torch.compile(op)
2025-05-07T20:31:41.2101709Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2101777Z     
2025-05-07T20:31:41.2101866Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2101982Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2102052Z     
2025-05-07T20:31:41.2102184Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2102282Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2102377Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2102501Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2102639Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2102711Z     
2025-05-07T20:31:41.2102813Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2102817Z 
2025-05-07T20:31:41.2102911Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2103043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2103145Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2103275Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2103834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2103932Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2104290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2104512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2104877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2105135Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2105536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2105866Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2106243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2106406Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2106747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2106821Z     fn()
2025-05-07T20:31:41.2107218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2107300Z     self.fn.run(
2025-05-07T20:31:41.2107636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2107724Z     kernel = self.compile(
2025-05-07T20:31:41.2108113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2108290Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2108419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2108423Z 
2025-05-07T20:31:41.2108623Z self = <triton.compiler.compiler.ASTSource object at 0x7f090cf0f290>
2025-05-07T20:31:41.2109391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2109888Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090db2e7a0>}
2025-05-07T20:31:41.2110713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2110910Z context = <triton._C.libtriton.ir.context object at 0x7f090cf169b0>
2025-05-07T20:31:41.2110914Z 
2025-05-07T20:31:41.2111073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2111336Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2111438Z                            module_map=module_map)
2025-05-07T20:31:41.2111593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2111695Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2111769Z E       ^
2025-05-07T20:31:41.2112129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2112134Z 
2025-05-07T20:31:41.2112555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2112563Z 
2025-05-07T20:31:41.2112660Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2112882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2112956Z     T=16384,
2025-05-07T20:31:41.2113030Z     D=5120,
2025-05-07T20:31:41.2113112Z     scale_ub=None,
2025-05-07T20:31:41.2113194Z     contiguous=True,
2025-05-07T20:31:41.2113273Z     compiled=True,
2025-05-07T20:31:41.2113345Z )
2025-05-07T20:31:41.2113558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2113724Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2113732Z 
2025-05-07T20:31:41.2113806Z     @given(
2025-05-07T20:31:41.2113919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2114018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2114133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2114327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2114439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2114510Z     )
2025-05-07T20:31:41.2114752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2114845Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2114917Z         self,
2025-05-07T20:31:41.2114992Z         T: int,
2025-05-07T20:31:41.2115069Z         D: int,
2025-05-07T20:31:41.2115160Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2115250Z         contiguous: bool,
2025-05-07T20:31:41.2115331Z         compiled: bool,
2025-05-07T20:31:41.2115403Z     ) -> None:
2025-05-07T20:31:41.2115494Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2115561Z     
2025-05-07T20:31:41.2115725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2115800Z     
2025-05-07T20:31:41.2115894Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2116020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2116106Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2116184Z         x0 = x[:, :D]
2025-05-07T20:31:41.2116257Z         x1 = x[:, D:]
2025-05-07T20:31:41.2116330Z     
2025-05-07T20:31:41.2116406Z         if contiguous:
2025-05-07T20:31:41.2116495Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2116581Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2116652Z     
2025-05-07T20:31:41.2116738Z         if scale_ub is not None:
2025-05-07T20:31:41.2116837Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2116970Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2117043Z             )
2025-05-07T20:31:41.2117113Z         else:
2025-05-07T20:31:41.2117202Z             scale_ub_tensor = None
2025-05-07T20:31:41.2117277Z     
2025-05-07T20:31:41.2117401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2117579Z             op = silu_mul_quant
2025-05-07T20:31:41.2117664Z             if compiled:
2025-05-07T20:31:41.2117760Z                 op = torch.compile(op)
2025-05-07T20:31:41.2117867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2117936Z     
2025-05-07T20:31:41.2118021Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2118143Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2118213Z     
2025-05-07T20:31:41.2118343Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2118444Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2118538Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2118657Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2118797Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2118866Z     
2025-05-07T20:31:41.2118963Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2118970Z 
2025-05-07T20:31:41.2119067Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2119193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2119303Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2119432Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2119990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2120093Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2120453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2120675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2121038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2121293Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2121776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2122027Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2122402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2122566Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2122907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2122984Z     fn()
2025-05-07T20:31:41.2123456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2123533Z     self.fn.run(
2025-05-07T20:31:41.2123876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2123971Z     kernel = self.compile(
2025-05-07T20:31:41.2124354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2124528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2124653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2124658Z 
2025-05-07T20:31:41.2124865Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c94dd90>
2025-05-07T20:31:41.2125634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2126239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d9de200>}
2025-05-07T20:31:41.2127012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2127198Z context = <triton._C.libtriton.ir.context object at 0x7f090c9514f0>
2025-05-07T20:31:41.2127203Z 
2025-05-07T20:31:41.2127364Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2127623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2127728Z                            module_map=module_map)
2025-05-07T20:31:41.2127885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2127983Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2128059Z E       ^
2025-05-07T20:31:41.2128415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2128423Z 
2025-05-07T20:31:41.2128834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2128843Z 
2025-05-07T20:31:41.2128940Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2129157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2129237Z     T=1,
2025-05-07T20:31:41.2129312Z     D=5120,
2025-05-07T20:31:41.2129391Z     scale_ub=1200.0,
2025-05-07T20:31:41.2129478Z     contiguous=True,
2025-05-07T20:31:41.2129561Z     compiled=True,
2025-05-07T20:31:41.2129629Z )
2025-05-07T20:31:41.2129848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2130008Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2130013Z 
2025-05-07T20:31:41.2130096Z     @given(
2025-05-07T20:31:41.2130214Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2130390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2130505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2130616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2130726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2130798Z     )
2025-05-07T20:31:41.2131038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2131125Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2131200Z         self,
2025-05-07T20:31:41.2131273Z         T: int,
2025-05-07T20:31:41.2131343Z         D: int,
2025-05-07T20:31:41.2131441Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2131526Z         contiguous: bool,
2025-05-07T20:31:41.2131609Z         compiled: bool,
2025-05-07T20:31:41.2131683Z     ) -> None:
2025-05-07T20:31:41.2131774Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2131845Z     
2025-05-07T20:31:41.2132017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2132092Z     
2025-05-07T20:31:41.2132183Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2132302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2132385Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2132465Z         x0 = x[:, :D]
2025-05-07T20:31:41.2132541Z         x1 = x[:, D:]
2025-05-07T20:31:41.2132611Z     
2025-05-07T20:31:41.2132694Z         if contiguous:
2025-05-07T20:31:41.2132778Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2132872Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2132938Z     
2025-05-07T20:31:41.2133024Z         if scale_ub is not None:
2025-05-07T20:31:41.2133129Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2133258Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2133331Z             )
2025-05-07T20:31:41.2133409Z         else:
2025-05-07T20:31:41.2133499Z             scale_ub_tensor = None
2025-05-07T20:31:41.2133650Z     
2025-05-07T20:31:41.2133788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2133872Z             op = silu_mul_quant
2025-05-07T20:31:41.2133952Z             if compiled:
2025-05-07T20:31:41.2134049Z                 op = torch.compile(op)
2025-05-07T20:31:41.2134152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2134228Z     
2025-05-07T20:31:41.2134316Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2134320Z 
2025-05-07T20:31:41.2134415Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2134542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2134638Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2134732Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2135101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2135187Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2135690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2135792Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2136147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2136371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2136710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2136801Z     kernel = self.compile(
2025-05-07T20:31:41.2137202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2137398Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2137529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2137534Z 
2025-05-07T20:31:41.2137825Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c9c7250>
2025-05-07T20:31:41.2138843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2139350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d451d00>}
2025-05-07T20:31:41.2140097Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2140292Z context = <triton._C.libtriton.ir.context object at 0x7f090c97f030>
2025-05-07T20:31:41.2140297Z 
2025-05-07T20:31:41.2140465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2140729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2140836Z                            module_map=module_map)
2025-05-07T20:31:41.2140993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2141092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2141164Z E       ^
2025-05-07T20:31:41.2141516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2141521Z 
2025-05-07T20:31:41.2141934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2141938Z 
2025-05-07T20:31:41.2142036Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2142260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2142335Z     T=1,
2025-05-07T20:31:41.2142543Z     D=5120,
2025-05-07T20:31:41.2142632Z     scale_ub=None,
2025-05-07T20:31:41.2142714Z     contiguous=False,
2025-05-07T20:31:41.2142794Z     compiled=True,
2025-05-07T20:31:41.2142867Z )
2025-05-07T20:31:41.2143083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2143245Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2143249Z 
2025-05-07T20:31:41.2143329Z     @given(
2025-05-07T20:31:41.2143443Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2143540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2143650Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2143760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2143872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2143940Z     )
2025-05-07T20:31:41.2144183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2144284Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2144362Z         self,
2025-05-07T20:31:41.2144434Z         T: int,
2025-05-07T20:31:41.2144505Z         D: int,
2025-05-07T20:31:41.2144600Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2144682Z         contiguous: bool,
2025-05-07T20:31:41.2144767Z         compiled: bool,
2025-05-07T20:31:41.2144840Z     ) -> None:
2025-05-07T20:31:41.2144933Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2145002Z     
2025-05-07T20:31:41.2145169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2145240Z     
2025-05-07T20:31:41.2145328Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2145449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2145536Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2145610Z         x0 = x[:, :D]
2025-05-07T20:31:41.2145683Z         x1 = x[:, D:]
2025-05-07T20:31:41.2145759Z     
2025-05-07T20:31:41.2145837Z         if contiguous:
2025-05-07T20:31:41.2146049Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2146137Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2146206Z     
2025-05-07T20:31:41.2146292Z         if scale_ub is not None:
2025-05-07T20:31:41.2146398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2146528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2146603Z             )
2025-05-07T20:31:41.2146676Z         else:
2025-05-07T20:31:41.2146769Z             scale_ub_tensor = None
2025-05-07T20:31:41.2146840Z     
2025-05-07T20:31:41.2146965Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2147051Z             op = silu_mul_quant
2025-05-07T20:31:41.2147146Z             if compiled:
2025-05-07T20:31:41.2147257Z                 op = torch.compile(op)
2025-05-07T20:31:41.2147374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2147454Z     
2025-05-07T20:31:41.2147540Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2147659Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2147735Z     
2025-05-07T20:31:41.2147865Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2147963Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2148058Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2148178Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2148317Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2148383Z     
2025-05-07T20:31:41.2148479Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2148483Z 
2025-05-07T20:31:41.2148582Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2148704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2148807Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2148937Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2149577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2149683Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2150045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2150266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2150639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2150891Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2151297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2151546Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2151922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2152093Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2152432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2152511Z     fn()
2025-05-07T20:31:41.2152909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2152985Z     self.fn.run(
2025-05-07T20:31:41.2153326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2153414Z     kernel = self.compile(
2025-05-07T20:31:41.2153794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2153968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2154096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2154201Z 
2025-05-07T20:31:41.2154410Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c19e090>
2025-05-07T20:31:41.2155180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2155671Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d451260>}
2025-05-07T20:31:41.2156423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2156615Z context = <triton._C.libtriton.ir.context object at 0x7f090c196c70>
2025-05-07T20:31:41.2156625Z 
2025-05-07T20:31:41.2156792Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2157052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2157152Z                            module_map=module_map)
2025-05-07T20:31:41.2157312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2157411Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2157488Z E       ^
2025-05-07T20:31:41.2157842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2157847Z 
2025-05-07T20:31:41.2158257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2158262Z 
2025-05-07T20:31:41.2158362Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2158657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2158741Z     T=1,
2025-05-07T20:31:41.2158813Z     D=5120,
2025-05-07T20:31:41.2158891Z     scale_ub=None,
2025-05-07T20:31:41.2158973Z     contiguous=True,
2025-05-07T20:31:41.2159054Z     compiled=False,
2025-05-07T20:31:41.2159122Z )
2025-05-07T20:31:41.2159338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2159497Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2159502Z 
2025-05-07T20:31:41.2159576Z     @given(
2025-05-07T20:31:41.2159694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2159789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2159903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2160012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2160122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2160201Z     )
2025-05-07T20:31:41.2165696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2165814Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2165893Z         self,
2025-05-07T20:31:41.2165969Z         T: int,
2025-05-07T20:31:41.2166040Z         D: int,
2025-05-07T20:31:41.2166145Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2166233Z         contiguous: bool,
2025-05-07T20:31:41.2166315Z         compiled: bool,
2025-05-07T20:31:41.2166398Z     ) -> None:
2025-05-07T20:31:41.2166489Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2166566Z     
2025-05-07T20:31:41.2166738Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2166811Z     
2025-05-07T20:31:41.2166902Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2167026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2167111Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2167193Z         x0 = x[:, :D]
2025-05-07T20:31:41.2167271Z         x1 = x[:, D:]
2025-05-07T20:31:41.2167450Z     
2025-05-07T20:31:41.2167540Z         if contiguous:
2025-05-07T20:31:41.2167631Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2167716Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2167792Z     
2025-05-07T20:31:41.2167876Z         if scale_ub is not None:
2025-05-07T20:31:41.2167980Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2168110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2168183Z             )
2025-05-07T20:31:41.2168263Z         else:
2025-05-07T20:31:41.2168352Z             scale_ub_tensor = None
2025-05-07T20:31:41.2168422Z     
2025-05-07T20:31:41.2168553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2168640Z             op = silu_mul_quant
2025-05-07T20:31:41.2168726Z             if compiled:
2025-05-07T20:31:41.2168825Z                 op = torch.compile(op)
2025-05-07T20:31:41.2168927Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2169009Z     
2025-05-07T20:31:41.2169102Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2169107Z 
2025-05-07T20:31:41.2169202Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2169338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2169438Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2169537Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2170045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2170142Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2170502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2170729Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2171234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2171333Z     kernel = self.compile(
2025-05-07T20:31:41.2171717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2171889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2172022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2172027Z 
2025-05-07T20:31:41.2172227Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c1a8250>
2025-05-07T20:31:41.2173001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2173502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090decff60>}
2025-05-07T20:31:41.2174252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2174442Z context = <triton._C.libtriton.ir.context object at 0x7f090c17b730>
2025-05-07T20:31:41.2174447Z 
2025-05-07T20:31:41.2174609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2174873Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2174977Z                            module_map=module_map)
2025-05-07T20:31:41.2175137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2175236Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2175309Z E       ^
2025-05-07T20:31:41.2175668Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2175755Z 
2025-05-07T20:31:41.2176171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2176176Z 
2025-05-07T20:31:41.2176275Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2176501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2176576Z     T=128,
2025-05-07T20:31:41.2176645Z     D=5120,
2025-05-07T20:31:41.2176725Z     scale_ub=None,
2025-05-07T20:31:41.2176807Z     contiguous=False,
2025-05-07T20:31:41.2176891Z     compiled=True,
2025-05-07T20:31:41.2176961Z )
2025-05-07T20:31:41.2177176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2177348Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2177353Z 
2025-05-07T20:31:41.2177427Z     @given(
2025-05-07T20:31:41.2177543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2177643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2177762Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2177874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2177987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2178060Z     )
2025-05-07T20:31:41.2178304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2178392Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2178467Z         self,
2025-05-07T20:31:41.2178547Z         T: int,
2025-05-07T20:31:41.2178622Z         D: int,
2025-05-07T20:31:41.2178714Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2178805Z         contiguous: bool,
2025-05-07T20:31:41.2178886Z         compiled: bool,
2025-05-07T20:31:41.2178957Z     ) -> None:
2025-05-07T20:31:41.2179050Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2179121Z     
2025-05-07T20:31:41.2179367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2179448Z     
2025-05-07T20:31:41.2179535Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2179659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2179744Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2179819Z         x0 = x[:, :D]
2025-05-07T20:31:41.2179898Z         x1 = x[:, D:]
2025-05-07T20:31:41.2179969Z     
2025-05-07T20:31:41.2180049Z         if contiguous:
2025-05-07T20:31:41.2180143Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2180229Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2180299Z     
2025-05-07T20:31:41.2180395Z         if scale_ub is not None:
2025-05-07T20:31:41.2180496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2180626Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2180700Z             )
2025-05-07T20:31:41.2180774Z         else:
2025-05-07T20:31:41.2180864Z             scale_ub_tensor = None
2025-05-07T20:31:41.2180935Z     
2025-05-07T20:31:41.2181071Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2181157Z             op = silu_mul_quant
2025-05-07T20:31:41.2181240Z             if compiled:
2025-05-07T20:31:41.2181336Z                 op = torch.compile(op)
2025-05-07T20:31:41.2181445Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2181513Z     
2025-05-07T20:31:41.2181601Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2181605Z 
2025-05-07T20:31:41.2181701Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2181824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2181918Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2182016Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2182382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2182474Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2182973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2183147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2183509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2183731Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2184079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2184171Z     kernel = self.compile(
2025-05-07T20:31:41.2184553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2184728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2184854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2184858Z 
2025-05-07T20:31:41.2185064Z self = <triton.compiler.compiler.ASTSource object at 0x7f08fff964d0>
2025-05-07T20:31:41.2185845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2186341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090decea20>}
2025-05-07T20:31:41.2187098Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2187284Z context = <triton._C.libtriton.ir.context object at 0x7f08fffe2330>
2025-05-07T20:31:41.2187288Z 
2025-05-07T20:31:41.2187454Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2187792Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2187901Z                            module_map=module_map)
2025-05-07T20:31:41.2188062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2188156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2188232Z E       ^
2025-05-07T20:31:41.2188583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2188588Z 
2025-05-07T20:31:41.2189000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2189005Z 
2025-05-07T20:31:41.2189106Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2189325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2189400Z     T=128,
2025-05-07T20:31:41.2189472Z     D=7168,
2025-05-07T20:31:41.2189554Z     scale_ub=1200.0,
2025-05-07T20:31:41.2189642Z     contiguous=False,
2025-05-07T20:31:41.2189721Z     compiled=False,
2025-05-07T20:31:41.2189791Z )
2025-05-07T20:31:41.2190007Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2190175Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2190179Z 
2025-05-07T20:31:41.2190248Z     @given(
2025-05-07T20:31:41.2190369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2190463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2190574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2190685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2190798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2190873Z     )
2025-05-07T20:31:41.2191114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2191204Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2191393Z         self,
2025-05-07T20:31:41.2191465Z         T: int,
2025-05-07T20:31:41.2191537Z         D: int,
2025-05-07T20:31:41.2191630Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2191711Z         contiguous: bool,
2025-05-07T20:31:41.2191790Z         compiled: bool,
2025-05-07T20:31:41.2191865Z     ) -> None:
2025-05-07T20:31:41.2191954Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2192024Z     
2025-05-07T20:31:41.2192188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2192259Z     
2025-05-07T20:31:41.2192346Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2192466Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2192547Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2192625Z         x0 = x[:, :D]
2025-05-07T20:31:41.2192698Z         x1 = x[:, D:]
2025-05-07T20:31:41.2192767Z     
2025-05-07T20:31:41.2192847Z         if contiguous:
2025-05-07T20:31:41.2192936Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2193025Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2193098Z     
2025-05-07T20:31:41.2193183Z         if scale_ub is not None:
2025-05-07T20:31:41.2193286Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2193416Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2193489Z             )
2025-05-07T20:31:41.2193564Z         else:
2025-05-07T20:31:41.2193651Z             scale_ub_tensor = None
2025-05-07T20:31:41.2193721Z     
2025-05-07T20:31:41.2193847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2193931Z             op = silu_mul_quant
2025-05-07T20:31:41.2194009Z             if compiled:
2025-05-07T20:31:41.2194107Z                 op = torch.compile(op)
2025-05-07T20:31:41.2194206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2194274Z     
2025-05-07T20:31:41.2194363Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2194368Z 
2025-05-07T20:31:41.2194540Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2194675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2194770Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2194866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2195369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2195462Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2195819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2196040Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2196378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2196470Z     kernel = self.compile(
2025-05-07T20:31:41.2196858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2197033Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2197160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2197165Z 
2025-05-07T20:31:41.2197366Z self = <triton.compiler.compiler.ASTSource object at 0x7f08fff8e650>
2025-05-07T20:31:41.2198138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2198633Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d4525c0>}
2025-05-07T20:31:41.2199386Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2199655Z context = <triton._C.libtriton.ir.context object at 0x7f08fff32530>
2025-05-07T20:31:41.2199659Z 
2025-05-07T20:31:41.2199819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2200088Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2200189Z                            module_map=module_map)
2025-05-07T20:31:41.2200347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2200439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2200511Z E       ^
2025-05-07T20:31:41.2200869Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2200874Z 
2025-05-07T20:31:41.2201293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2201302Z 
2025-05-07T20:31:41.2201399Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2201622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2201694Z     T=128,
2025-05-07T20:31:41.2201766Z     D=5120,
2025-05-07T20:31:41.2201845Z     scale_ub=None,
2025-05-07T20:31:41.2201923Z     contiguous=False,
2025-05-07T20:31:41.2202003Z     compiled=False,
2025-05-07T20:31:41.2202070Z )
2025-05-07T20:31:41.2202283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2202451Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2202455Z 
2025-05-07T20:31:41.2202528Z     @given(
2025-05-07T20:31:41.2202642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2202741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2202848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2203044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2203158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2203228Z     )
2025-05-07T20:31:41.2203563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2203653Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2203726Z         self,
2025-05-07T20:31:41.2203801Z         T: int,
2025-05-07T20:31:41.2203873Z         D: int,
2025-05-07T20:31:41.2203962Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2204046Z         contiguous: bool,
2025-05-07T20:31:41.2204126Z         compiled: bool,
2025-05-07T20:31:41.2204200Z     ) -> None:
2025-05-07T20:31:41.2204290Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2204359Z     
2025-05-07T20:31:41.2204522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2204597Z     
2025-05-07T20:31:41.2204682Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2204810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2204897Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2204972Z         x0 = x[:, :D]
2025-05-07T20:31:41.2205052Z         x1 = x[:, D:]
2025-05-07T20:31:41.2205121Z     
2025-05-07T20:31:41.2205198Z         if contiguous:
2025-05-07T20:31:41.2205288Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2205373Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2205440Z     
2025-05-07T20:31:41.2205525Z         if scale_ub is not None:
2025-05-07T20:31:41.2205628Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2205757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2205830Z             )
2025-05-07T20:31:41.2205902Z         else:
2025-05-07T20:31:41.2205991Z             scale_ub_tensor = None
2025-05-07T20:31:41.2206065Z     
2025-05-07T20:31:41.2206190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2206279Z             op = silu_mul_quant
2025-05-07T20:31:41.2206449Z             if compiled:
2025-05-07T20:31:41.2206544Z                 op = torch.compile(op)
2025-05-07T20:31:41.2206647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2206715Z     
2025-05-07T20:31:41.2206800Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2206805Z 
2025-05-07T20:31:41.2206903Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2207035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2207145Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2207259Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2207764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2207859Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2208217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2208439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2208790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2208877Z     kernel = self.compile(
2025-05-07T20:31:41.2209262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2209432Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2209552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2209557Z 
2025-05-07T20:31:41.2209762Z self = <triton.compiler.compiler.ASTSource object at 0x7f08fffdd210>
2025-05-07T20:31:41.2210618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2211127Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090d38b7e0>}
2025-05-07T20:31:41.2211874Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2212060Z context = <triton._C.libtriton.ir.context object at 0x7f08fff810f0>
2025-05-07T20:31:41.2212065Z 
2025-05-07T20:31:41.2212231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2212493Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2212600Z                            module_map=module_map)
2025-05-07T20:31:41.2212758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2212855Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2212936Z E       ^
2025-05-07T20:31:41.2213290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2213295Z 
2025-05-07T20:31:41.2213707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2213715Z 
2025-05-07T20:31:41.2213813Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2214031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2214109Z     T=128,
2025-05-07T20:31:41.2214180Z     D=5120,
2025-05-07T20:31:41.2214255Z     scale_ub=1200.0,
2025-05-07T20:31:41.2214334Z     contiguous=True,
2025-05-07T20:31:41.2214413Z     compiled=False,
2025-05-07T20:31:41.2214483Z )
2025-05-07T20:31:41.2214700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2214868Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2214951Z 
2025-05-07T20:31:41.2215026Z     @given(
2025-05-07T20:31:41.2215139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2215234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2215346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2215458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2215566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2215637Z     )
2025-05-07T20:31:41.2215876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2215964Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2216039Z         self,
2025-05-07T20:31:41.2216110Z         T: int,
2025-05-07T20:31:41.2216180Z         D: int,
2025-05-07T20:31:41.2216276Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2216360Z         contiguous: bool,
2025-05-07T20:31:41.2216440Z         compiled: bool,
2025-05-07T20:31:41.2216524Z     ) -> None:
2025-05-07T20:31:41.2216612Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2216683Z     
2025-05-07T20:31:41.2216846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2216916Z     
2025-05-07T20:31:41.2217004Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2217125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2217207Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2217285Z         x0 = x[:, :D]
2025-05-07T20:31:41.2217360Z         x1 = x[:, D:]
2025-05-07T20:31:41.2217429Z     
2025-05-07T20:31:41.2217511Z         if contiguous:
2025-05-07T20:31:41.2217597Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2217681Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2217749Z     
2025-05-07T20:31:41.2217832Z         if scale_ub is not None:
2025-05-07T20:31:41.2217937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2218171Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2218247Z             )
2025-05-07T20:31:41.2218324Z         else:
2025-05-07T20:31:41.2218412Z             scale_ub_tensor = None
2025-05-07T20:31:41.2218476Z     
2025-05-07T20:31:41.2218605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2218690Z             op = silu_mul_quant
2025-05-07T20:31:41.2218773Z             if compiled:
2025-05-07T20:31:41.2218869Z                 op = torch.compile(op)
2025-05-07T20:31:41.2218970Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2219041Z     
2025-05-07T20:31:41.2219130Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2219134Z 
2025-05-07T20:31:41.2219228Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2219351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2219447Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2219542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2220049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2220144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2220502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2220723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2221060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2221148Z     kernel = self.compile(
2025-05-07T20:31:41.2221536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2221707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2221833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2221837Z 
2025-05-07T20:31:41.2222041Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c237b90>
2025-05-07T20:31:41.2222945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2223443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c6dea20>}
2025-05-07T20:31:41.2224191Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2224378Z context = <triton._C.libtriton.ir.context object at 0x7f090c2a3a70>
2025-05-07T20:31:41.2224382Z 
2025-05-07T20:31:41.2224548Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2224817Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2224919Z                            module_map=module_map)
2025-05-07T20:31:41.2225077Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2225174Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2225249Z E       ^
2025-05-07T20:31:41.2225600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2225605Z 
2025-05-07T20:31:41.2226018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2226023Z 
2025-05-07T20:31:41.2226120Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2226342Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2226415Z     T=1,
2025-05-07T20:31:41.2226487Z     D=7168,
2025-05-07T20:31:41.2226651Z     scale_ub=1200.0,
2025-05-07T20:31:41.2226733Z     contiguous=True,
2025-05-07T20:31:41.2226808Z     compiled=True,
2025-05-07T20:31:41.2226880Z )
2025-05-07T20:31:41.2227095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2227256Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2227263Z 
2025-05-07T20:31:41.2227336Z     @given(
2025-05-07T20:31:41.2227451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2227548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2227656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2227766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2227876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2227946Z     )
2025-05-07T20:31:41.2228184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2228279Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2228355Z         self,
2025-05-07T20:31:41.2228427Z         T: int,
2025-05-07T20:31:41.2228502Z         D: int,
2025-05-07T20:31:41.2228593Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2228680Z         contiguous: bool,
2025-05-07T20:31:41.2228760Z         compiled: bool,
2025-05-07T20:31:41.2228832Z     ) -> None:
2025-05-07T20:31:41.2228924Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2228995Z     
2025-05-07T20:31:41.2229158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2229232Z     
2025-05-07T20:31:41.2229317Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2229436Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2229522Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2229596Z         x0 = x[:, :D]
2025-05-07T20:31:41.2229669Z         x1 = x[:, D:]
2025-05-07T20:31:41.2229739Z     
2025-05-07T20:31:41.2229816Z         if contiguous:
2025-05-07T20:31:41.2229906Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2230075Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2230146Z     
2025-05-07T20:31:41.2230232Z         if scale_ub is not None:
2025-05-07T20:31:41.2230332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2230462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2230538Z             )
2025-05-07T20:31:41.2230609Z         else:
2025-05-07T20:31:41.2230697Z             scale_ub_tensor = None
2025-05-07T20:31:41.2230768Z     
2025-05-07T20:31:41.2230892Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2230976Z             op = silu_mul_quant
2025-05-07T20:31:41.2231058Z             if compiled:
2025-05-07T20:31:41.2231154Z                 op = torch.compile(op)
2025-05-07T20:31:41.2231256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2231327Z     
2025-05-07T20:31:41.2231412Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2231417Z 
2025-05-07T20:31:41.2231514Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2231642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2231734Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2231832Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2232199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2232285Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2232783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2232874Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2233233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2233452Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2233870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2233969Z     kernel = self.compile(
2025-05-07T20:31:41.2234352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2234521Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2234648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2234653Z 
2025-05-07T20:31:41.2234852Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c21ab10>
2025-05-07T20:31:41.2235621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2236119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c6dd440>}
2025-05-07T20:31:41.2236872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2237057Z context = <triton._C.libtriton.ir.context object at 0x7f090c20a0b0>
2025-05-07T20:31:41.2237061Z 
2025-05-07T20:31:41.2237222Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2237487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2237590Z                            module_map=module_map)
2025-05-07T20:31:41.2237750Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2237843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2237915Z E       ^
2025-05-07T20:31:41.2238276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2238356Z 
2025-05-07T20:31:41.2239036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2239043Z 
2025-05-07T20:31:41.2239147Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2239367Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2239438Z     T=1,
2025-05-07T20:31:41.2239513Z     D=7168,
2025-05-07T20:31:41.2239593Z     scale_ub=1200.0,
2025-05-07T20:31:41.2239673Z     contiguous=False,
2025-05-07T20:31:41.2239754Z     compiled=True,
2025-05-07T20:31:41.2239823Z )
2025-05-07T20:31:41.2240036Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2240202Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2240206Z 
2025-05-07T20:31:41.2240279Z     @given(
2025-05-07T20:31:41.2240403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2240506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2240616Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2240731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2240838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2240907Z     )
2025-05-07T20:31:41.2241150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2241237Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2241311Z         self,
2025-05-07T20:31:41.2241382Z         T: int,
2025-05-07T20:31:41.2241451Z         D: int,
2025-05-07T20:31:41.2241542Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2241628Z         contiguous: bool,
2025-05-07T20:31:41.2241705Z         compiled: bool,
2025-05-07T20:31:41.2241780Z     ) -> None:
2025-05-07T20:31:41.2241870Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2241935Z     
2025-05-07T20:31:41.2242262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2242332Z     
2025-05-07T20:31:41.2242420Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2242543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2242627Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2242700Z         x0 = x[:, :D]
2025-05-07T20:31:41.2242778Z         x1 = x[:, D:]
2025-05-07T20:31:41.2242848Z     
2025-05-07T20:31:41.2242930Z         if contiguous:
2025-05-07T20:31:41.2243019Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2243102Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2243171Z     
2025-05-07T20:31:41.2243361Z         if scale_ub is not None:
2025-05-07T20:31:41.2243464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2243597Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2243667Z             )
2025-05-07T20:31:41.2243738Z         else:
2025-05-07T20:31:41.2243837Z             scale_ub_tensor = None
2025-05-07T20:31:41.2243909Z     
2025-05-07T20:31:41.2244034Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2244125Z             op = silu_mul_quant
2025-05-07T20:31:41.2244206Z             if compiled:
2025-05-07T20:31:41.2244303Z                 op = torch.compile(op)
2025-05-07T20:31:41.2244408Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2244478Z     
2025-05-07T20:31:41.2244564Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2244573Z 
2025-05-07T20:31:41.2244662Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2244788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2244888Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2244982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2245352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2245442Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2246065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2246156Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2246513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2246732Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2247075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2247164Z     kernel = self.compile(
2025-05-07T20:31:41.2247547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2247720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2247847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2247857Z 
2025-05-07T20:31:41.2248063Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffeb41d0>
2025-05-07T20:31:41.2248833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2249325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c6dcc20>}
2025-05-07T20:31:41.2250075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2250261Z context = <triton._C.libtriton.ir.context object at 0x7f08ffefc0b0>
2025-05-07T20:31:41.2250265Z 
2025-05-07T20:31:41.2250528Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2250795Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2250896Z                            module_map=module_map)
2025-05-07T20:31:41.2251057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2251149Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2251224Z E       ^
2025-05-07T20:31:41.2251577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2251581Z 
2025-05-07T20:31:41.2251993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2251997Z 
2025-05-07T20:31:41.2252099Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2252319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2252398Z     T=1,
2025-05-07T20:31:41.2252474Z     D=7168,
2025-05-07T20:31:41.2252554Z     scale_ub=None,
2025-05-07T20:31:41.2252639Z     contiguous=False,
2025-05-07T20:31:41.2252717Z     compiled=True,
2025-05-07T20:31:41.2252785Z )
2025-05-07T20:31:41.2253005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2253164Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2253169Z 
2025-05-07T20:31:41.2253243Z     @given(
2025-05-07T20:31:41.2253361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2253454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2253564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2253679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2253786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2253858Z     )
2025-05-07T20:31:41.2254105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2254279Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2254355Z         self,
2025-05-07T20:31:41.2254427Z         T: int,
2025-05-07T20:31:41.2254497Z         D: int,
2025-05-07T20:31:41.2254592Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2254676Z         contiguous: bool,
2025-05-07T20:31:41.2254754Z         compiled: bool,
2025-05-07T20:31:41.2254831Z     ) -> None:
2025-05-07T20:31:41.2254921Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2254988Z     
2025-05-07T20:31:41.2255157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2255225Z     
2025-05-07T20:31:41.2255315Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2255434Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2255513Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2255595Z         x0 = x[:, :D]
2025-05-07T20:31:41.2255669Z         x1 = x[:, D:]
2025-05-07T20:31:41.2255738Z     
2025-05-07T20:31:41.2255828Z         if contiguous:
2025-05-07T20:31:41.2255919Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2256003Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2256077Z     
2025-05-07T20:31:41.2256160Z         if scale_ub is not None:
2025-05-07T20:31:41.2256259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2256391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2256460Z             )
2025-05-07T20:31:41.2256533Z         else:
2025-05-07T20:31:41.2256619Z             scale_ub_tensor = None
2025-05-07T20:31:41.2256689Z     
2025-05-07T20:31:41.2256820Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2256904Z             op = silu_mul_quant
2025-05-07T20:31:41.2256982Z             if compiled:
2025-05-07T20:31:41.2257079Z                 op = torch.compile(op)
2025-05-07T20:31:41.2257181Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2257250Z     
2025-05-07T20:31:41.2257419Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2257546Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2257615Z     
2025-05-07T20:31:41.2257749Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2257847Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2257943Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2258060Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2258195Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2258267Z     
2025-05-07T20:31:41.2258363Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2258368Z 
2025-05-07T20:31:41.2258460Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2258584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2258686Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2258814Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2259383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2259482Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2259844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2260063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2260427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2260684Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2261081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2261333Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2261791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2261954Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2262298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2262370Z     fn()
2025-05-07T20:31:41.2262770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2262849Z     self.fn.run(
2025-05-07T20:31:41.2263185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2263279Z     kernel = self.compile(
2025-05-07T20:31:41.2263659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2263834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2263968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2263973Z 
2025-05-07T20:31:41.2264174Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffed5ed0>
2025-05-07T20:31:41.2264950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2265444Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090dc9eb60>}
2025-05-07T20:31:41.2266199Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2266462Z context = <triton._C.libtriton.ir.context object at 0x7f08ffe21e30>
2025-05-07T20:31:41.2266470Z 
2025-05-07T20:31:41.2266633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2266895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2266997Z                            module_map=module_map)
2025-05-07T20:31:41.2267152Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2267254Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2267330Z E       ^
2025-05-07T20:31:41.2267690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2267695Z 
2025-05-07T20:31:41.2268107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2268112Z 
2025-05-07T20:31:41.2268210Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2268438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2268511Z     T=1,
2025-05-07T20:31:41.2268589Z     D=5120,
2025-05-07T20:31:41.2268666Z     scale_ub=1200.0,
2025-05-07T20:31:41.2268746Z     contiguous=False,
2025-05-07T20:31:41.2268827Z     compiled=True,
2025-05-07T20:31:41.2268895Z )
2025-05-07T20:31:41.2269110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2269274Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2269279Z 
2025-05-07T20:31:41.2269352Z     @given(
2025-05-07T20:31:41.2269465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2269562Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2269672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2269783Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2269899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2270054Z     )
2025-05-07T20:31:41.2270297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2270385Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2270456Z         self,
2025-05-07T20:31:41.2270533Z         T: int,
2025-05-07T20:31:41.2270607Z         D: int,
2025-05-07T20:31:41.2270701Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2270786Z         contiguous: bool,
2025-05-07T20:31:41.2270866Z         compiled: bool,
2025-05-07T20:31:41.2270939Z     ) -> None:
2025-05-07T20:31:41.2271029Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2271097Z     
2025-05-07T20:31:41.2271262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2271337Z     
2025-05-07T20:31:41.2271424Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2271549Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2271631Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2271708Z         x0 = x[:, :D]
2025-05-07T20:31:41.2271791Z         x1 = x[:, D:]
2025-05-07T20:31:41.2271859Z     
2025-05-07T20:31:41.2271937Z         if contiguous:
2025-05-07T20:31:41.2272022Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2272106Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2272175Z     
2025-05-07T20:31:41.2272260Z         if scale_ub is not None:
2025-05-07T20:31:41.2272361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2272492Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2272565Z             )
2025-05-07T20:31:41.2272636Z         else:
2025-05-07T20:31:41.2272726Z             scale_ub_tensor = None
2025-05-07T20:31:41.2272795Z     
2025-05-07T20:31:41.2272921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2273009Z             op = silu_mul_quant
2025-05-07T20:31:41.2273087Z             if compiled:
2025-05-07T20:31:41.2273182Z                 op = torch.compile(op)
2025-05-07T20:31:41.2273368Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2273442Z     
2025-05-07T20:31:41.2273529Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2273533Z 
2025-05-07T20:31:41.2273628Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2273751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2273848Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2273940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2274307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2274401Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2274893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2274985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2275353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2275577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2275918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2276008Z     kernel = self.compile(
2025-05-07T20:31:41.2276388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2276561Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2276683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2276688Z 
2025-05-07T20:31:41.2276888Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c865f90>
2025-05-07T20:31:41.2277670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2278272Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090dc9f880>}
2025-05-07T20:31:41.2279034Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2279220Z context = <triton._C.libtriton.ir.context object at 0x7f090c8b1e30>
2025-05-07T20:31:41.2279225Z 
2025-05-07T20:31:41.2279389Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2279650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2279751Z                            module_map=module_map)
2025-05-07T20:31:41.2279911Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2280020Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2280092Z E       ^
2025-05-07T20:31:41.2280449Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2280454Z 
2025-05-07T20:31:41.2280867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2280871Z 
2025-05-07T20:31:41.2280971Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2281192Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2281265Z     T=1,
2025-05-07T20:31:41.2281337Z     D=5120,
2025-05-07T20:31:41.2281417Z     scale_ub=1200.0,
2025-05-07T20:31:41.2281499Z     contiguous=False,
2025-05-07T20:31:41.2281578Z     compiled=False,
2025-05-07T20:31:41.2281648Z )
2025-05-07T20:31:41.2281865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2282106Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2282115Z 
2025-05-07T20:31:41.2282190Z     @given(
2025-05-07T20:31:41.2282308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2282401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2282510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2282624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2287900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2287990Z     )
2025-05-07T20:31:41.2288244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2288338Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2288415Z         self,
2025-05-07T20:31:41.2288487Z         T: int,
2025-05-07T20:31:41.2288562Z         D: int,
2025-05-07T20:31:41.2288660Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2288747Z         contiguous: bool,
2025-05-07T20:31:41.2288835Z         compiled: bool,
2025-05-07T20:31:41.2288920Z     ) -> None:
2025-05-07T20:31:41.2289012Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2289082Z     
2025-05-07T20:31:41.2289253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2289327Z     
2025-05-07T20:31:41.2289414Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2289543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2289628Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2289710Z         x0 = x[:, :D]
2025-05-07T20:31:41.2289785Z         x1 = x[:, D:]
2025-05-07T20:31:41.2289856Z     
2025-05-07T20:31:41.2289939Z         if contiguous:
2025-05-07T20:31:41.2290028Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2290113Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2290187Z     
2025-05-07T20:31:41.2290272Z         if scale_ub is not None:
2025-05-07T20:31:41.2290374Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2290511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2290693Z             )
2025-05-07T20:31:41.2290767Z         else:
2025-05-07T20:31:41.2290862Z             scale_ub_tensor = None
2025-05-07T20:31:41.2290933Z     
2025-05-07T20:31:41.2291066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2291152Z             op = silu_mul_quant
2025-05-07T20:31:41.2291232Z             if compiled:
2025-05-07T20:31:41.2291334Z                 op = torch.compile(op)
2025-05-07T20:31:41.2291438Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2291509Z     
2025-05-07T20:31:41.2291597Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2291602Z 
2025-05-07T20:31:41.2291699Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2291826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2291926Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2292025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2292538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2292640Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2293004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2293227Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2293568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2293659Z     kernel = self.compile(
2025-05-07T20:31:41.2294052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2294223Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2294352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2294441Z 
2025-05-07T20:31:41.2294648Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c89f3d0>
2025-05-07T20:31:41.2295416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2295915Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090ce76480>}
2025-05-07T20:31:41.2296662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2296852Z context = <triton._C.libtriton.ir.context object at 0x7f090c8a3270>
2025-05-07T20:31:41.2296857Z 
2025-05-07T20:31:41.2297023Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2297290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2297391Z                            module_map=module_map)
2025-05-07T20:31:41.2297548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2297648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2297720Z E       ^
2025-05-07T20:31:41.2298072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2298077Z 
2025-05-07T20:31:41.2298494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2298498Z 
2025-05-07T20:31:41.2298596Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2298818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2298892Z     T=16384,
2025-05-07T20:31:41.2299054Z     D=5120,
2025-05-07T20:31:41.2299135Z     scale_ub=1200.0,
2025-05-07T20:31:41.2299215Z     contiguous=False,
2025-05-07T20:31:41.2299291Z     compiled=True,
2025-05-07T20:31:41.2299364Z )
2025-05-07T20:31:41.2299577Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2299752Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2299759Z 
2025-05-07T20:31:41.2299832Z     @given(
2025-05-07T20:31:41.2299949Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2300045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2300154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2300267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2300381Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2300448Z     )
2025-05-07T20:31:41.2300694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2300791Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2300864Z         self,
2025-05-07T20:31:41.2300934Z         T: int,
2025-05-07T20:31:41.2301010Z         D: int,
2025-05-07T20:31:41.2301103Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2301189Z         contiguous: bool,
2025-05-07T20:31:41.2301268Z         compiled: bool,
2025-05-07T20:31:41.2301340Z     ) -> None:
2025-05-07T20:31:41.2301435Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2301503Z     
2025-05-07T20:31:41.2301668Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2301741Z     
2025-05-07T20:31:41.2301827Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2301946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2302034Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2302111Z         x0 = x[:, :D]
2025-05-07T20:31:41.2302188Z         x1 = x[:, D:]
2025-05-07T20:31:41.2302265Z     
2025-05-07T20:31:41.2302491Z         if contiguous:
2025-05-07T20:31:41.2302591Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2302676Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2302745Z     
2025-05-07T20:31:41.2302834Z         if scale_ub is not None:
2025-05-07T20:31:41.2302935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2303067Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2303141Z             )
2025-05-07T20:31:41.2303215Z         else:
2025-05-07T20:31:41.2303303Z             scale_ub_tensor = None
2025-05-07T20:31:41.2303373Z     
2025-05-07T20:31:41.2303500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2303583Z             op = silu_mul_quant
2025-05-07T20:31:41.2303669Z             if compiled:
2025-05-07T20:31:41.2303763Z                 op = torch.compile(op)
2025-05-07T20:31:41.2303866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2303938Z     
2025-05-07T20:31:41.2304022Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2304038Z 
2025-05-07T20:31:41.2304133Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2304256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2304352Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2304451Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2304818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2304905Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2305401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2305494Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2305857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2306078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2306502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2306592Z     kernel = self.compile(
2025-05-07T20:31:41.2306973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2307151Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2307273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2307277Z 
2025-05-07T20:31:41.2307483Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c811d90>
2025-05-07T20:31:41.2308254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2308754Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090ce751c0>}
2025-05-07T20:31:41.2309515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2309699Z context = <triton._C.libtriton.ir.context object at 0x7f090c899c70>
2025-05-07T20:31:41.2309704Z 
2025-05-07T20:31:41.2309873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2310134Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2310238Z                            module_map=module_map)
2025-05-07T20:31:41.2310398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2310490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2310563Z E       ^
2025-05-07T20:31:41.2311000Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2311010Z 
2025-05-07T20:31:41.2311424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2311429Z 
2025-05-07T20:31:41.2311529Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2311747Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2311820Z     T=2048,
2025-05-07T20:31:41.2311896Z     D=7168,
2025-05-07T20:31:41.2311971Z     scale_ub=1200.0,
2025-05-07T20:31:41.2312056Z     contiguous=False,
2025-05-07T20:31:41.2312132Z     compiled=True,
2025-05-07T20:31:41.2312200Z )
2025-05-07T20:31:41.2312417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2312590Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2312594Z 
2025-05-07T20:31:41.2312672Z     @given(
2025-05-07T20:31:41.2312791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2312885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2312994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2313109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2313218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2313292Z     )
2025-05-07T20:31:41.2313537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2313623Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2313697Z         self,
2025-05-07T20:31:41.2313767Z         T: int,
2025-05-07T20:31:41.2313836Z         D: int,
2025-05-07T20:31:41.2313930Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2314014Z         contiguous: bool,
2025-05-07T20:31:41.2314093Z         compiled: bool,
2025-05-07T20:31:41.2314167Z     ) -> None:
2025-05-07T20:31:41.2314260Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2314436Z     
2025-05-07T20:31:41.2314604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2314675Z     
2025-05-07T20:31:41.2314766Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2314885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2314967Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2315043Z         x0 = x[:, :D]
2025-05-07T20:31:41.2315117Z         x1 = x[:, D:]
2025-05-07T20:31:41.2315184Z     
2025-05-07T20:31:41.2315265Z         if contiguous:
2025-05-07T20:31:41.2315351Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2315434Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2315505Z     
2025-05-07T20:31:41.2315591Z         if scale_ub is not None:
2025-05-07T20:31:41.2315694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2315826Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2315894Z             )
2025-05-07T20:31:41.2315971Z         else:
2025-05-07T20:31:41.2316065Z             scale_ub_tensor = None
2025-05-07T20:31:41.2316135Z     
2025-05-07T20:31:41.2316263Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2316348Z             op = silu_mul_quant
2025-05-07T20:31:41.2316429Z             if compiled:
2025-05-07T20:31:41.2316525Z                 op = torch.compile(op)
2025-05-07T20:31:41.2316624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2316694Z     
2025-05-07T20:31:41.2316783Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2316787Z 
2025-05-07T20:31:41.2316879Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2317002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2317098Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2317192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2317561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2317734Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2318233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2318329Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2318687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2318911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2319255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2319343Z     kernel = self.compile(
2025-05-07T20:31:41.2319729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2319900Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2320027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2320036Z 
2025-05-07T20:31:41.2320238Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c556c10>
2025-05-07T20:31:41.2321006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2321508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090ce76fc0>}
2025-05-07T20:31:41.2322255Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2322443Z context = <triton._C.libtriton.ir.context object at 0x7f090c5a3b30>
2025-05-07T20:31:41.2322529Z 
2025-05-07T20:31:41.2322691Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2322951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2323057Z                            module_map=module_map)
2025-05-07T20:31:41.2323219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2323386Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2323464Z E       ^
2025-05-07T20:31:41.2323815Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2323820Z 
2025-05-07T20:31:41.2324239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2324243Z 
2025-05-07T20:31:41.2324340Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2324565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2324647Z     T=1,
2025-05-07T20:31:41.2324721Z     D=5120,
2025-05-07T20:31:41.2324796Z     scale_ub=None,
2025-05-07T20:31:41.2324879Z     contiguous=False,
2025-05-07T20:31:41.2324954Z     compiled=False,
2025-05-07T20:31:41.2325027Z )
2025-05-07T20:31:41.2325242Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2325403Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2325407Z 
2025-05-07T20:31:41.2325484Z     @given(
2025-05-07T20:31:41.2325598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2325692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2325804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2325918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2326027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2326100Z     )
2025-05-07T20:31:41.2326423Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2326519Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2326591Z         self,
2025-05-07T20:31:41.2326662Z         T: int,
2025-05-07T20:31:41.2326734Z         D: int,
2025-05-07T20:31:41.2326827Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2326911Z         contiguous: bool,
2025-05-07T20:31:41.2326993Z         compiled: bool,
2025-05-07T20:31:41.2327065Z     ) -> None:
2025-05-07T20:31:41.2327155Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2327222Z     
2025-05-07T20:31:41.2327387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2327456Z     
2025-05-07T20:31:41.2327549Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2327669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2327755Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2327829Z         x0 = x[:, :D]
2025-05-07T20:31:41.2327902Z         x1 = x[:, D:]
2025-05-07T20:31:41.2327988Z     
2025-05-07T20:31:41.2328064Z         if contiguous:
2025-05-07T20:31:41.2328153Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2328239Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2328307Z     
2025-05-07T20:31:41.2328392Z         if scale_ub is not None:
2025-05-07T20:31:41.2328499Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2328627Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2328697Z             )
2025-05-07T20:31:41.2328775Z         else:
2025-05-07T20:31:41.2328862Z             scale_ub_tensor = None
2025-05-07T20:31:41.2328929Z     
2025-05-07T20:31:41.2329056Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2329139Z             op = silu_mul_quant
2025-05-07T20:31:41.2329222Z             if compiled:
2025-05-07T20:31:41.2329316Z                 op = torch.compile(op)
2025-05-07T20:31:41.2329416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2329492Z     
2025-05-07T20:31:41.2329659Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2329664Z 
2025-05-07T20:31:41.2329756Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2329883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2329977Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2330071Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2330575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2330667Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2331027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2331244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2331581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2331684Z     kernel = self.compile(
2025-05-07T20:31:41.2332071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2332249Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2332371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2332376Z 
2025-05-07T20:31:41.2332576Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c51fd90>
2025-05-07T20:31:41.2333352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2333848Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c5b0860>}
2025-05-07T20:31:41.2334687Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2334874Z context = <triton._C.libtriton.ir.context object at 0x7f090c5d3c70>
2025-05-07T20:31:41.2334878Z 
2025-05-07T20:31:41.2335039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2335306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2335408Z                            module_map=module_map)
2025-05-07T20:31:41.2335566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2335657Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2335728Z E       ^
2025-05-07T20:31:41.2336086Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2336095Z 
2025-05-07T20:31:41.2336511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2336516Z 
2025-05-07T20:31:41.2336616Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2336833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2336905Z     T=4096,
2025-05-07T20:31:41.2336985Z     D=7168,
2025-05-07T20:31:41.2337061Z     scale_ub=1200.0,
2025-05-07T20:31:41.2337139Z     contiguous=False,
2025-05-07T20:31:41.2337224Z     compiled=False,
2025-05-07T20:31:41.2337288Z )
2025-05-07T20:31:41.2337503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2337679Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2337683Z 
2025-05-07T20:31:41.2337756Z     @given(
2025-05-07T20:31:41.2337872Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2337973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2338164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2338279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2338629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2338737Z     )
2025-05-07T20:31:41.2339013Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2339102Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2339174Z         self,
2025-05-07T20:31:41.2339247Z         T: int,
2025-05-07T20:31:41.2339315Z         D: int,
2025-05-07T20:31:41.2339408Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2339493Z         contiguous: bool,
2025-05-07T20:31:41.2339573Z         compiled: bool,
2025-05-07T20:31:41.2339647Z     ) -> None:
2025-05-07T20:31:41.2339737Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2339804Z     
2025-05-07T20:31:41.2339981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2340057Z     
2025-05-07T20:31:41.2340144Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2340266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2340350Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2340425Z         x0 = x[:, :D]
2025-05-07T20:31:41.2340503Z         x1 = x[:, D:]
2025-05-07T20:31:41.2340571Z     
2025-05-07T20:31:41.2340648Z         if contiguous:
2025-05-07T20:31:41.2340741Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2340827Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2340900Z     
2025-05-07T20:31:41.2340984Z         if scale_ub is not None:
2025-05-07T20:31:41.2341085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2341218Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2341289Z             )
2025-05-07T20:31:41.2341361Z         else:
2025-05-07T20:31:41.2341455Z             scale_ub_tensor = None
2025-05-07T20:31:41.2341525Z     
2025-05-07T20:31:41.2341821Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2341918Z             op = silu_mul_quant
2025-05-07T20:31:41.2341998Z             if compiled:
2025-05-07T20:31:41.2342093Z                 op = torch.compile(op)
2025-05-07T20:31:41.2342198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2342264Z     
2025-05-07T20:31:41.2342350Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2342355Z 
2025-05-07T20:31:41.2342450Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2342573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2342669Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2342767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2343266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2343361Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2343727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2343953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2344295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2344384Z     kernel = self.compile(
2025-05-07T20:31:41.2344766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2344940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2345063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2345067Z 
2025-05-07T20:31:41.2345270Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffbc24d0>
2025-05-07T20:31:41.2346044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2346666Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c5b19e0>}
2025-05-07T20:31:41.2347421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2347608Z context = <triton._C.libtriton.ir.context object at 0x7f08ffb5e3b0>
2025-05-07T20:31:41.2347613Z 
2025-05-07T20:31:41.2347781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2348041Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2348146Z                            module_map=module_map)
2025-05-07T20:31:41.2348314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2348407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2348481Z E       ^
2025-05-07T20:31:41.2348834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2348839Z 
2025-05-07T20:31:41.2349250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2349259Z 
2025-05-07T20:31:41.2349356Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2349578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2349653Z     T=16384,
2025-05-07T20:31:41.2349725Z     D=7168,
2025-05-07T20:31:41.2349800Z     scale_ub=None,
2025-05-07T20:31:41.2349883Z     contiguous=True,
2025-05-07T20:31:41.2349959Z     compiled=True,
2025-05-07T20:31:41.2350029Z )
2025-05-07T20:31:41.2350325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2350500Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2350504Z 
2025-05-07T20:31:41.2350576Z     @given(
2025-05-07T20:31:41.2350689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2350784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2350902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2351014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2351121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2351196Z     )
2025-05-07T20:31:41.2351438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2351525Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2351603Z         self,
2025-05-07T20:31:41.2351675Z         T: int,
2025-05-07T20:31:41.2351745Z         D: int,
2025-05-07T20:31:41.2351849Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2351936Z         contiguous: bool,
2025-05-07T20:31:41.2352017Z         compiled: bool,
2025-05-07T20:31:41.2352094Z     ) -> None:
2025-05-07T20:31:41.2352182Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2352255Z     
2025-05-07T20:31:41.2352419Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2352488Z     
2025-05-07T20:31:41.2352576Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2352695Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2352779Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2352854Z         x0 = x[:, :D]
2025-05-07T20:31:41.2352930Z         x1 = x[:, D:]
2025-05-07T20:31:41.2353001Z     
2025-05-07T20:31:41.2353081Z         if contiguous:
2025-05-07T20:31:41.2353167Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2353251Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2353322Z     
2025-05-07T20:31:41.2353405Z         if scale_ub is not None:
2025-05-07T20:31:41.2353513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2353722Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2353793Z             )
2025-05-07T20:31:41.2353869Z         else:
2025-05-07T20:31:41.2353956Z             scale_ub_tensor = None
2025-05-07T20:31:41.2354025Z     
2025-05-07T20:31:41.2354152Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2354236Z             op = silu_mul_quant
2025-05-07T20:31:41.2354315Z             if compiled:
2025-05-07T20:31:41.2354411Z                 op = torch.compile(op)
2025-05-07T20:31:41.2354512Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2354580Z     
2025-05-07T20:31:41.2354669Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2354673Z 
2025-05-07T20:31:41.2354765Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2354892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2354986Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2355093Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2355470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2355555Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2356051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2356147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2356504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2356726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2357064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2357152Z     kernel = self.compile(
2025-05-07T20:31:41.2357623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2357800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2357925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2357930Z 
2025-05-07T20:31:41.2358131Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffbe9150>
2025-05-07T20:31:41.2358900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2359400Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c5b2b60>}
2025-05-07T20:31:41.2360156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2360349Z context = <triton._C.libtriton.ir.context object at 0x7f08ffbc9030>
2025-05-07T20:31:41.2360354Z 
2025-05-07T20:31:41.2360514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2360772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2360878Z                            module_map=module_map)
2025-05-07T20:31:41.2361035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2361132Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2361207Z E       ^
2025-05-07T20:31:41.2361562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2361567Z 
2025-05-07T20:31:41.2361991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2362074Z 
2025-05-07T20:31:41.2362172Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2362394Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2362464Z     T=4096,
2025-05-07T20:31:41.2362536Z     D=5120,
2025-05-07T20:31:41.2362614Z     scale_ub=None,
2025-05-07T20:31:41.2362694Z     contiguous=False,
2025-05-07T20:31:41.2362768Z     compiled=True,
2025-05-07T20:31:41.2362841Z )
2025-05-07T20:31:41.2363054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2363222Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2363227Z 
2025-05-07T20:31:41.2363394Z     @given(
2025-05-07T20:31:41.2363508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2363603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2363713Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2363837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2363948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2364018Z     )
2025-05-07T20:31:41.2364258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2364350Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2364419Z         self,
2025-05-07T20:31:41.2364491Z         T: int,
2025-05-07T20:31:41.2364564Z         D: int,
2025-05-07T20:31:41.2364656Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2364738Z         contiguous: bool,
2025-05-07T20:31:41.2364821Z         compiled: bool,
2025-05-07T20:31:41.2364893Z     ) -> None:
2025-05-07T20:31:41.2364989Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2365059Z     
2025-05-07T20:31:41.2365223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2365294Z     
2025-05-07T20:31:41.2365380Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2365583Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2365677Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2365750Z         x0 = x[:, :D]
2025-05-07T20:31:41.2365823Z         x1 = x[:, D:]
2025-05-07T20:31:41.2365896Z     
2025-05-07T20:31:41.2365973Z         if contiguous:
2025-05-07T20:31:41.2366059Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2366148Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2366216Z     
2025-05-07T20:31:41.2366299Z         if scale_ub is not None:
2025-05-07T20:31:41.2366407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2366539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2366615Z             )
2025-05-07T20:31:41.2366684Z         else:
2025-05-07T20:31:41.2366778Z             scale_ub_tensor = None
2025-05-07T20:31:41.2366856Z     
2025-05-07T20:31:41.2366986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2367070Z             op = silu_mul_quant
2025-05-07T20:31:41.2367163Z             if compiled:
2025-05-07T20:31:41.2367257Z                 op = torch.compile(op)
2025-05-07T20:31:41.2367359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2367427Z     
2025-05-07T20:31:41.2367513Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2367518Z 
2025-05-07T20:31:41.2367613Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2367737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2367833Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2367930Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2368296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2368382Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2368879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2368976Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2369420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2369642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2369981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2370074Z     kernel = self.compile(
2025-05-07T20:31:41.2370457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2370628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2370753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2370758Z 
2025-05-07T20:31:41.2370958Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c4f2050>
2025-05-07T20:31:41.2371735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2372236Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c5b3d80>}
2025-05-07T20:31:41.2372993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2373180Z context = <triton._C.libtriton.ir.context object at 0x7f090c4bdf30>
2025-05-07T20:31:41.2373185Z 
2025-05-07T20:31:41.2373345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2373608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2373809Z                            module_map=module_map)
2025-05-07T20:31:41.2373972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2374069Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2374147Z E       ^
2025-05-07T20:31:41.2374501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2374506Z 
2025-05-07T20:31:41.2374917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2374921Z 
2025-05-07T20:31:41.2375019Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2375242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2375315Z     T=4096,
2025-05-07T20:31:41.2375392Z     D=5120,
2025-05-07T20:31:41.2375471Z     scale_ub=1200.0,
2025-05-07T20:31:41.2375550Z     contiguous=False,
2025-05-07T20:31:41.2375635Z     compiled=False,
2025-05-07T20:31:41.2375712Z )
2025-05-07T20:31:41.2375926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2376104Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2376109Z 
2025-05-07T20:31:41.2376182Z     @given(
2025-05-07T20:31:41.2376295Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2376392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2376502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2376616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2376725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2376798Z     )
2025-05-07T20:31:41.2377043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2377130Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2377201Z         self,
2025-05-07T20:31:41.2377280Z         T: int,
2025-05-07T20:31:41.2377354Z         D: int,
2025-05-07T20:31:41.2377526Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2377612Z         contiguous: bool,
2025-05-07T20:31:41.2377691Z         compiled: bool,
2025-05-07T20:31:41.2377763Z     ) -> None:
2025-05-07T20:31:41.2377853Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2377922Z     
2025-05-07T20:31:41.2378092Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2378162Z     
2025-05-07T20:31:41.2378247Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2378371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2378457Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2378529Z         x0 = x[:, :D]
2025-05-07T20:31:41.2378608Z         x1 = x[:, D:]
2025-05-07T20:31:41.2378677Z     
2025-05-07T20:31:41.2378754Z         if contiguous:
2025-05-07T20:31:41.2378843Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2378928Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2378995Z     
2025-05-07T20:31:41.2379086Z         if scale_ub is not None:
2025-05-07T20:31:41.2379192Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2379321Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2379397Z             )
2025-05-07T20:31:41.2379468Z         else:
2025-05-07T20:31:41.2379562Z             scale_ub_tensor = None
2025-05-07T20:31:41.2379628Z     
2025-05-07T20:31:41.2379753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2379839Z             op = silu_mul_quant
2025-05-07T20:31:41.2379917Z             if compiled:
2025-05-07T20:31:41.2380010Z                 op = torch.compile(op)
2025-05-07T20:31:41.2380111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2380177Z     
2025-05-07T20:31:41.2380262Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2380266Z 
2025-05-07T20:31:41.2380360Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2380565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2380669Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2380762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2381261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2381354Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2381712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2381930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2382273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2382363Z     kernel = self.compile(
2025-05-07T20:31:41.2382749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2382922Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2383050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2383054Z 
2025-05-07T20:31:41.2383258Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c462ad0>
2025-05-07T20:31:41.2384026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2384524Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c444c20>}
2025-05-07T20:31:41.2385277Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2385466Z context = <triton._C.libtriton.ir.context object at 0x7f090c4229b0>
2025-05-07T20:31:41.2385557Z 
2025-05-07T20:31:41.2385720Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2385982Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2386086Z                            module_map=module_map)
2025-05-07T20:31:41.2386242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2386334Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2386409Z E       ^
2025-05-07T20:31:41.2386766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2386770Z 
2025-05-07T20:31:41.2387189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2387194Z 
2025-05-07T20:31:41.2387291Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2387522Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2387599Z     T=4096,
2025-05-07T20:31:41.2387670Z     D=5120,
2025-05-07T20:31:41.2387747Z     scale_ub=1200.0,
2025-05-07T20:31:41.2387831Z     contiguous=False,
2025-05-07T20:31:41.2387910Z     compiled=True,
2025-05-07T20:31:41.2387980Z )
2025-05-07T20:31:41.2388196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2388365Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2388370Z 
2025-05-07T20:31:41.2388444Z     @given(
2025-05-07T20:31:41.2388558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2388653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2388769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2388883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2389072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2389148Z     )
2025-05-07T20:31:41.2389388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2389480Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2389551Z         self,
2025-05-07T20:31:41.2389621Z         T: int,
2025-05-07T20:31:41.2389694Z         D: int,
2025-05-07T20:31:41.2389784Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2389868Z         contiguous: bool,
2025-05-07T20:31:41.2389951Z         compiled: bool,
2025-05-07T20:31:41.2390024Z     ) -> None:
2025-05-07T20:31:41.2390114Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2390185Z     
2025-05-07T20:31:41.2390352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2390421Z     
2025-05-07T20:31:41.2390509Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2390628Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2390709Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2390790Z         x0 = x[:, :D]
2025-05-07T20:31:41.2390869Z         x1 = x[:, D:]
2025-05-07T20:31:41.2390943Z     
2025-05-07T20:31:41.2391021Z         if contiguous:
2025-05-07T20:31:41.2391108Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2391194Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2391263Z     
2025-05-07T20:31:41.2391345Z         if scale_ub is not None:
2025-05-07T20:31:41.2391450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2391578Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2391649Z             )
2025-05-07T20:31:41.2391722Z         else:
2025-05-07T20:31:41.2391811Z             scale_ub_tensor = None
2025-05-07T20:31:41.2391879Z     
2025-05-07T20:31:41.2392007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2392091Z             op = silu_mul_quant
2025-05-07T20:31:41.2392173Z             if compiled:
2025-05-07T20:31:41.2392266Z                 op = torch.compile(op)
2025-05-07T20:31:41.2392371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2392523Z     
2025-05-07T20:31:41.2392611Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2392616Z 
2025-05-07T20:31:41.2392709Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2392834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2392932Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2393024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2393395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2393484Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2393982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2394073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2394435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2394666Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2395004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2395093Z     kernel = self.compile(
2025-05-07T20:31:41.2395480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2395651Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2395775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2395781Z 
2025-05-07T20:31:41.2395982Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c4b1310>
2025-05-07T20:31:41.2396830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2397336Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c445f80>}
2025-05-07T20:31:41.2398088Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2398277Z context = <triton._C.libtriton.ir.context object at 0x7f090c4a11f0>
2025-05-07T20:31:41.2398282Z 
2025-05-07T20:31:41.2398443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2398705Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2398808Z                            module_map=module_map)
2025-05-07T20:31:41.2398972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2399073Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2399147Z E       ^
2025-05-07T20:31:41.2399501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2399506Z 
2025-05-07T20:31:41.2399920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2399924Z 
2025-05-07T20:31:41.2400021Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2400242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2400315Z     T=2048,
2025-05-07T20:31:41.2400387Z     D=7168,
2025-05-07T20:31:41.2400467Z     scale_ub=1200.0,
2025-05-07T20:31:41.2400549Z     contiguous=False,
2025-05-07T20:31:41.2400629Z     compiled=False,
2025-05-07T20:31:41.2400702Z )
2025-05-07T20:31:41.2400915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2401194Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2401201Z 
2025-05-07T20:31:41.2401275Z     @given(
2025-05-07T20:31:41.2401393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2401490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2401599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2401710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2401822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2401890Z     )
2025-05-07T20:31:41.2402131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2402221Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2402293Z         self,
2025-05-07T20:31:41.2402365Z         T: int,
2025-05-07T20:31:41.2402433Z         D: int,
2025-05-07T20:31:41.2402524Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2402616Z         contiguous: bool,
2025-05-07T20:31:41.2402699Z         compiled: bool,
2025-05-07T20:31:41.2402770Z     ) -> None:
2025-05-07T20:31:41.2402863Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2402931Z     
2025-05-07T20:31:41.2403097Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2403170Z     
2025-05-07T20:31:41.2408405Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2408551Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2408643Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2408724Z         x0 = x[:, :D]
2025-05-07T20:31:41.2408802Z         x1 = x[:, D:]
2025-05-07T20:31:41.2408873Z     
2025-05-07T20:31:41.2408955Z         if contiguous:
2025-05-07T20:31:41.2409051Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2409137Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2409206Z     
2025-05-07T20:31:41.2409297Z         if scale_ub is not None:
2025-05-07T20:31:41.2409400Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2409647Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2409723Z             )
2025-05-07T20:31:41.2409794Z         else:
2025-05-07T20:31:41.2409886Z             scale_ub_tensor = None
2025-05-07T20:31:41.2409958Z     
2025-05-07T20:31:41.2410088Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2410175Z             op = silu_mul_quant
2025-05-07T20:31:41.2410261Z             if compiled:
2025-05-07T20:31:41.2410359Z                 op = torch.compile(op)
2025-05-07T20:31:41.2410464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2410534Z     
2025-05-07T20:31:41.2410620Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2410625Z 
2025-05-07T20:31:41.2410721Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2410846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2410941Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2411044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2411549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2411647Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2412007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2412228Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2412573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2412664Z     kernel = self.compile(
2025-05-07T20:31:41.2413049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2413227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2413356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2413440Z 
2025-05-07T20:31:41.2413648Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffc39ed0>
2025-05-07T20:31:41.2414425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2414924Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c446d40>}
2025-05-07T20:31:41.2415679Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2415866Z context = <triton._C.libtriton.ir.context object at 0x7f08ffca9db0>
2025-05-07T20:31:41.2415871Z 
2025-05-07T20:31:41.2416041Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2416308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2416411Z                            module_map=module_map)
2025-05-07T20:31:41.2416573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2416668Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2416746Z E       ^
2025-05-07T20:31:41.2417102Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2417107Z 
2025-05-07T20:31:41.2417525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2417529Z 
2025-05-07T20:31:41.2417629Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2417851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2418004Z     T=1,
2025-05-07T20:31:41.2418083Z     D=7168,
2025-05-07T20:31:41.2418163Z     scale_ub=None,
2025-05-07T20:31:41.2418247Z     contiguous=True,
2025-05-07T20:31:41.2418329Z     compiled=False,
2025-05-07T20:31:41.2418395Z )
2025-05-07T20:31:41.2418612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2418773Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2418777Z 
2025-05-07T20:31:41.2418853Z     @given(
2025-05-07T20:31:41.2418974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2419071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2419185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2419300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2419412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2419482Z     )
2025-05-07T20:31:41.2419728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2419819Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2419897Z         self,
2025-05-07T20:31:41.2419968Z         T: int,
2025-05-07T20:31:41.2420041Z         D: int,
2025-05-07T20:31:41.2420139Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2420225Z         contiguous: bool,
2025-05-07T20:31:41.2420304Z         compiled: bool,
2025-05-07T20:31:41.2420380Z     ) -> None:
2025-05-07T20:31:41.2420473Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2420545Z     
2025-05-07T20:31:41.2420712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2420786Z     
2025-05-07T20:31:41.2420876Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2420995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2421081Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2421162Z         x0 = x[:, :D]
2025-05-07T20:31:41.2421240Z         x1 = x[:, D:]
2025-05-07T20:31:41.2421309Z     
2025-05-07T20:31:41.2421485Z         if contiguous:
2025-05-07T20:31:41.2421572Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2421655Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2421726Z     
2025-05-07T20:31:41.2421813Z         if scale_ub is not None:
2025-05-07T20:31:41.2421917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2422052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2422124Z             )
2025-05-07T20:31:41.2422200Z         else:
2025-05-07T20:31:41.2422289Z             scale_ub_tensor = None
2025-05-07T20:31:41.2422358Z     
2025-05-07T20:31:41.2422488Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2422573Z             op = silu_mul_quant
2025-05-07T20:31:41.2422658Z             if compiled:
2025-05-07T20:31:41.2422763Z                 op = torch.compile(op)
2025-05-07T20:31:41.2422864Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2422932Z     
2025-05-07T20:31:41.2423029Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2423040Z 
2025-05-07T20:31:41.2423131Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2423260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2423357Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2423453Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2423959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2424051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2424409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2424630Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2424972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2425148Z     kernel = self.compile(
2025-05-07T20:31:41.2425538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2425710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2425837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2425842Z 
2025-05-07T20:31:41.2426045Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffcfd050>
2025-05-07T20:31:41.2426816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2427343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f090c446fc0>}
2025-05-07T20:31:41.2428123Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2428314Z context = <triton._C.libtriton.ir.context object at 0x7f08ffc00ef0>
2025-05-07T20:31:41.2428319Z 
2025-05-07T20:31:41.2428480Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2428741Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2428843Z                            module_map=module_map)
2025-05-07T20:31:41.2429001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2429100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2429172Z E       ^
2025-05-07T20:31:41.2429523Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2429531Z 
2025-05-07T20:31:41.2429947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2430108Z 
2025-05-07T20:31:41.2430208Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2430431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2430506Z     T=16384,
2025-05-07T20:31:41.2430572Z     D=7168,
2025-05-07T20:31:41.2430655Z     scale_ub=1200.0,
2025-05-07T20:31:41.2430738Z     contiguous=False,
2025-05-07T20:31:41.2430819Z     compiled=True,
2025-05-07T20:31:41.2430887Z )
2025-05-07T20:31:41.2431100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2431280Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2431285Z 
2025-05-07T20:31:41.2431358Z     @given(
2025-05-07T20:31:41.2431476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2431567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2431687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2431801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2431911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2431981Z     )
2025-05-07T20:31:41.2432223Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2432311Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2432383Z         self,
2025-05-07T20:31:41.2432458Z         T: int,
2025-05-07T20:31:41.2432525Z         D: int,
2025-05-07T20:31:41.2432618Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2432703Z         contiguous: bool,
2025-05-07T20:31:41.2432783Z         compiled: bool,
2025-05-07T20:31:41.2432856Z     ) -> None:
2025-05-07T20:31:41.2432947Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2433015Z     
2025-05-07T20:31:41.2433182Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2433249Z     
2025-05-07T20:31:41.2433419Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2433545Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2433629Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2433703Z         x0 = x[:, :D]
2025-05-07T20:31:41.2433781Z         x1 = x[:, D:]
2025-05-07T20:31:41.2433848Z     
2025-05-07T20:31:41.2433927Z         if contiguous:
2025-05-07T20:31:41.2434016Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2434100Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2434168Z     
2025-05-07T20:31:41.2434254Z         if scale_ub is not None:
2025-05-07T20:31:41.2434354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2434487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2434557Z             )
2025-05-07T20:31:41.2434626Z         else:
2025-05-07T20:31:41.2434719Z             scale_ub_tensor = None
2025-05-07T20:31:41.2434788Z     
2025-05-07T20:31:41.2434919Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2435012Z             op = silu_mul_quant
2025-05-07T20:31:41.2435090Z             if compiled:
2025-05-07T20:31:41.2435184Z                 op = torch.compile(op)
2025-05-07T20:31:41.2435286Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2435354Z     
2025-05-07T20:31:41.2435437Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2435445Z 
2025-05-07T20:31:41.2435536Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2435657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2435756Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2435853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2436222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2436313Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2436814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2437018Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2437400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2437645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2437987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2438078Z     kernel = self.compile(
2025-05-07T20:31:41.2438686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2438938Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2439067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2439072Z 
2025-05-07T20:31:41.2439286Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c3c5050>
2025-05-07T20:31:41.2440062Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2440560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffc3d300>}
2025-05-07T20:31:41.2441312Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2441498Z context = <triton._C.libtriton.ir.context object at 0x7f090c3075f0>
2025-05-07T20:31:41.2441503Z 
2025-05-07T20:31:41.2441665Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2442066Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2442175Z                            module_map=module_map)
2025-05-07T20:31:41.2442337Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2442432Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2442509Z E       ^
2025-05-07T20:31:41.2442860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2442865Z 
2025-05-07T20:31:41.2443406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2443411Z 
2025-05-07T20:31:41.2443510Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2443730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2443805Z     T=1,
2025-05-07T20:31:41.2443877Z     D=7168,
2025-05-07T20:31:41.2443953Z     scale_ub=None,
2025-05-07T20:31:41.2444046Z     contiguous=False,
2025-05-07T20:31:41.2444132Z     compiled=False,
2025-05-07T20:31:41.2444200Z )
2025-05-07T20:31:41.2444416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2444579Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2444583Z 
2025-05-07T20:31:41.2444656Z     @given(
2025-05-07T20:31:41.2444771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2444867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2444980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2445091Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2445199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2445268Z     )
2025-05-07T20:31:41.2445508Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2445595Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2445669Z         self,
2025-05-07T20:31:41.2445870Z         T: int,
2025-05-07T20:31:41.2445941Z         D: int,
2025-05-07T20:31:41.2446037Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2446122Z         contiguous: bool,
2025-05-07T20:31:41.2446200Z         compiled: bool,
2025-05-07T20:31:41.2446276Z     ) -> None:
2025-05-07T20:31:41.2446364Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2446437Z     
2025-05-07T20:31:41.2446602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2446672Z     
2025-05-07T20:31:41.2446762Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2446883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2446965Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2447041Z         x0 = x[:, :D]
2025-05-07T20:31:41.2447115Z         x1 = x[:, D:]
2025-05-07T20:31:41.2447183Z     
2025-05-07T20:31:41.2447261Z         if contiguous:
2025-05-07T20:31:41.2447346Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2447435Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2447512Z     
2025-05-07T20:31:41.2447600Z         if scale_ub is not None:
2025-05-07T20:31:41.2447699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2447832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2447904Z             )
2025-05-07T20:31:41.2447979Z         else:
2025-05-07T20:31:41.2448069Z             scale_ub_tensor = None
2025-05-07T20:31:41.2448138Z     
2025-05-07T20:31:41.2448266Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2448350Z             op = silu_mul_quant
2025-05-07T20:31:41.2448429Z             if compiled:
2025-05-07T20:31:41.2448526Z                 op = torch.compile(op)
2025-05-07T20:31:41.2448627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2448696Z     
2025-05-07T20:31:41.2448784Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2448788Z 
2025-05-07T20:31:41.2448880Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2449089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2449190Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2449285Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2449790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2449881Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2450237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2450458Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2450795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2450884Z     kernel = self.compile(
2025-05-07T20:31:41.2451276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2451448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2451572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2451577Z 
2025-05-07T20:31:41.2451779Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c39c890>
2025-05-07T20:31:41.2452548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2453043Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffc3e0c0>}
2025-05-07T20:31:41.2453796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2454063Z context = <triton._C.libtriton.ir.context object at 0x7f090c3bc730>
2025-05-07T20:31:41.2454068Z 
2025-05-07T20:31:41.2454228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2454488Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2454590Z                            module_map=module_map)
2025-05-07T20:31:41.2454748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2454843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2454913Z E       ^
2025-05-07T20:31:41.2455263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2455271Z 
2025-05-07T20:31:41.2455680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2455690Z 
2025-05-07T20:31:41.2455796Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2456020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2456092Z     T=2048,
2025-05-07T20:31:41.2456161Z     D=7168,
2025-05-07T20:31:41.2456239Z     scale_ub=None,
2025-05-07T20:31:41.2456323Z     contiguous=False,
2025-05-07T20:31:41.2456399Z     compiled=True,
2025-05-07T20:31:41.2456468Z )
2025-05-07T20:31:41.2456682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2456851Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2456856Z 
2025-05-07T20:31:41.2456930Z     @given(
2025-05-07T20:31:41.2457044Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2457139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2457247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2457436Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2457555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2457625Z     )
2025-05-07T20:31:41.2457863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2457954Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2458024Z         self,
2025-05-07T20:31:41.2458100Z         T: int,
2025-05-07T20:31:41.2458170Z         D: int,
2025-05-07T20:31:41.2458262Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2458351Z         contiguous: bool,
2025-05-07T20:31:41.2458430Z         compiled: bool,
2025-05-07T20:31:41.2458501Z     ) -> None:
2025-05-07T20:31:41.2458594Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2458662Z     
2025-05-07T20:31:41.2458828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2458901Z     
2025-05-07T20:31:41.2458988Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2459113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2459202Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2459274Z         x0 = x[:, :D]
2025-05-07T20:31:41.2459350Z         x1 = x[:, D:]
2025-05-07T20:31:41.2459419Z     
2025-05-07T20:31:41.2459495Z         if contiguous:
2025-05-07T20:31:41.2459583Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2459666Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2459735Z     
2025-05-07T20:31:41.2459821Z         if scale_ub is not None:
2025-05-07T20:31:41.2459921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2460050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2460124Z             )
2025-05-07T20:31:41.2460196Z         else:
2025-05-07T20:31:41.2460283Z             scale_ub_tensor = None
2025-05-07T20:31:41.2460352Z     
2025-05-07T20:31:41.2460478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2460567Z             op = silu_mul_quant
2025-05-07T20:31:41.2460645Z             if compiled:
2025-05-07T20:31:41.2460831Z                 op = torch.compile(op)
2025-05-07T20:31:41.2460935Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2461002Z     
2025-05-07T20:31:41.2461086Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2461090Z 
2025-05-07T20:31:41.2461186Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2461310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2461404Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2461501Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2461867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2461955Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2462449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2462543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2462909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2463136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2463474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2463568Z     kernel = self.compile(
2025-05-07T20:31:41.2463949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2464119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2464246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2464251Z 
2025-05-07T20:31:41.2464453Z self = <triton.compiler.compiler.ASTSource object at 0x7f090c35f390>
2025-05-07T20:31:41.2465326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2465832Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffc3f560>}
2025-05-07T20:31:41.2466584Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2466774Z context = <triton._C.libtriton.ir.context object at 0x7f090c3f7230>
2025-05-07T20:31:41.2466778Z 
2025-05-07T20:31:41.2466939Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2467200Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2467306Z                            module_map=module_map)
2025-05-07T20:31:41.2467468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2467567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2467635Z E       ^
2025-05-07T20:31:41.2467990Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2467994Z 
2025-05-07T20:31:41.2468408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2468413Z 
2025-05-07T20:31:41.2468510Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2468731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2468803Z     T=4096,
2025-05-07T20:31:41.2468874Z     D=7168,
2025-05-07T20:31:41.2468954Z     scale_ub=None,
2025-05-07T20:31:41.2469034Z     contiguous=False,
2025-05-07T20:31:41.2469114Z     compiled=True,
2025-05-07T20:31:41.2469181Z )
2025-05-07T20:31:41.2469400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2469652Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2469657Z 
2025-05-07T20:31:41.2469729Z     @given(
2025-05-07T20:31:41.2469842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2469939Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2470049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2470161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2470274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2470345Z     )
2025-05-07T20:31:41.2470588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2470677Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2470750Z         self,
2025-05-07T20:31:41.2470825Z         T: int,
2025-05-07T20:31:41.2470897Z         D: int,
2025-05-07T20:31:41.2470995Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2471089Z         contiguous: bool,
2025-05-07T20:31:41.2471169Z         compiled: bool,
2025-05-07T20:31:41.2471239Z     ) -> None:
2025-05-07T20:31:41.2471332Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2471403Z     
2025-05-07T20:31:41.2471567Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2471643Z     
2025-05-07T20:31:41.2471730Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2471855Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2471941Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2472016Z         x0 = x[:, :D]
2025-05-07T20:31:41.2472095Z         x1 = x[:, D:]
2025-05-07T20:31:41.2472164Z     
2025-05-07T20:31:41.2472242Z         if contiguous:
2025-05-07T20:31:41.2472333Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2472418Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2472487Z     
2025-05-07T20:31:41.2472580Z         if scale_ub is not None:
2025-05-07T20:31:41.2472763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2472895Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2472972Z             )
2025-05-07T20:31:41.2473044Z         else:
2025-05-07T20:31:41.2473134Z             scale_ub_tensor = None
2025-05-07T20:31:41.2473205Z     
2025-05-07T20:31:41.2473332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2473418Z             op = silu_mul_quant
2025-05-07T20:31:41.2473498Z             if compiled:
2025-05-07T20:31:41.2473595Z                 op = torch.compile(op)
2025-05-07T20:31:41.2473697Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2473764Z     
2025-05-07T20:31:41.2473848Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2473852Z 
2025-05-07T20:31:41.2473948Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2474070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2474174Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2474275Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2474643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2474732Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2475228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2475322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2475682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2475901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2476242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2476330Z     kernel = self.compile(
2025-05-07T20:31:41.2476718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2476973Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2477098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2477102Z 
2025-05-07T20:31:41.2477307Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffad0310>
2025-05-07T20:31:41.2478085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2478584Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffa287c0>}
2025-05-07T20:31:41.2479345Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2479538Z context = <triton._C.libtriton.ir.context object at 0x7f08ffab41b0>
2025-05-07T20:31:41.2479543Z 
2025-05-07T20:31:41.2479705Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2479964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2480065Z                            module_map=module_map)
2025-05-07T20:31:41.2480225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2480319Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2480388Z E       ^
2025-05-07T20:31:41.2480744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2480748Z 
2025-05-07T20:31:41.2481239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2481249Z 
2025-05-07T20:31:41.2481351Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2481570Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2481644Z     T=16384,
2025-05-07T20:31:41.2481720Z     D=5120,
2025-05-07T20:31:41.2481802Z     scale_ub=1200.0,
2025-05-07T20:31:41.2481884Z     contiguous=False,
2025-05-07T20:31:41.2481966Z     compiled=False,
2025-05-07T20:31:41.2482036Z )
2025-05-07T20:31:41.2482254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2482432Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2482436Z 
2025-05-07T20:31:41.2482508Z     @given(
2025-05-07T20:31:41.2482625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2482721Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2482836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2482955Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2483063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2483135Z     )
2025-05-07T20:31:41.2483484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2483574Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2483650Z         self,
2025-05-07T20:31:41.2483722Z         T: int,
2025-05-07T20:31:41.2483794Z         D: int,
2025-05-07T20:31:41.2483893Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2483977Z         contiguous: bool,
2025-05-07T20:31:41.2484057Z         compiled: bool,
2025-05-07T20:31:41.2484135Z     ) -> None:
2025-05-07T20:31:41.2484225Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2484294Z     
2025-05-07T20:31:41.2484462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2484533Z     
2025-05-07T20:31:41.2484620Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2484834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2484917Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2484993Z         x0 = x[:, :D]
2025-05-07T20:31:41.2485068Z         x1 = x[:, D:]
2025-05-07T20:31:41.2485136Z     
2025-05-07T20:31:41.2485217Z         if contiguous:
2025-05-07T20:31:41.2485304Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2485389Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2485463Z     
2025-05-07T20:31:41.2485550Z         if scale_ub is not None:
2025-05-07T20:31:41.2485652Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2485786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2485857Z             )
2025-05-07T20:31:41.2485929Z         else:
2025-05-07T20:31:41.2486020Z             scale_ub_tensor = None
2025-05-07T20:31:41.2486089Z     
2025-05-07T20:31:41.2486219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2486309Z             op = silu_mul_quant
2025-05-07T20:31:41.2486393Z             if compiled:
2025-05-07T20:31:41.2486494Z                 op = torch.compile(op)
2025-05-07T20:31:41.2486594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2486665Z     
2025-05-07T20:31:41.2486757Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2486761Z 
2025-05-07T20:31:41.2486853Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2486976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2487076Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2487172Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2487678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2487769Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2488129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2488440Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2488789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2488880Z     kernel = self.compile(
2025-05-07T20:31:41.2489267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2489438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2489565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2489569Z 
2025-05-07T20:31:41.2489772Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffa0cd90>
2025-05-07T20:31:41.2490548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2491055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffa29620>}
2025-05-07T20:31:41.2491808Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2491998Z context = <triton._C.libtriton.ir.context object at 0x7f08ffaa4c70>
2025-05-07T20:31:41.2492002Z 
2025-05-07T20:31:41.2492163Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2492426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2492530Z                            module_map=module_map)
2025-05-07T20:31:41.2492687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2492787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2492938Z E       ^
2025-05-07T20:31:41.2493290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2493295Z 
2025-05-07T20:31:41.2493711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2493715Z 
2025-05-07T20:31:41.2493815Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2494037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2494111Z     T=16384,
2025-05-07T20:31:41.2494184Z     D=5120,
2025-05-07T20:31:41.2494266Z     scale_ub=1200.0,
2025-05-07T20:31:41.2494347Z     contiguous=True,
2025-05-07T20:31:41.2494426Z     compiled=True,
2025-05-07T20:31:41.2494496Z )
2025-05-07T20:31:41.2494710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2494883Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2494895Z 
2025-05-07T20:31:41.2494969Z     @given(
2025-05-07T20:31:41.2495084Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2495182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2495293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2495407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2495520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2495589Z     )
2025-05-07T20:31:41.2495832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2495922Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2495996Z         self,
2025-05-07T20:31:41.2496069Z         T: int,
2025-05-07T20:31:41.2496147Z         D: int,
2025-05-07T20:31:41.2496239Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2496326Z         contiguous: bool,
2025-05-07T20:31:41.2496510Z         compiled: bool,
2025-05-07T20:31:41.2496587Z     ) -> None:
2025-05-07T20:31:41.2496681Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2496751Z     
2025-05-07T20:31:41.2496918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2496993Z     
2025-05-07T20:31:41.2497082Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2497203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2497288Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2497364Z         x0 = x[:, :D]
2025-05-07T20:31:41.2497438Z         x1 = x[:, D:]
2025-05-07T20:31:41.2497511Z     
2025-05-07T20:31:41.2497590Z         if contiguous:
2025-05-07T20:31:41.2497676Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2497763Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2497830Z     
2025-05-07T20:31:41.2497917Z         if scale_ub is not None:
2025-05-07T20:31:41.2498018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2498155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2498233Z             )
2025-05-07T20:31:41.2498306Z         else:
2025-05-07T20:31:41.2498395Z             scale_ub_tensor = None
2025-05-07T20:31:41.2498467Z     
2025-05-07T20:31:41.2498593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2498676Z             op = silu_mul_quant
2025-05-07T20:31:41.2498764Z             if compiled:
2025-05-07T20:31:41.2498860Z                 op = torch.compile(op)
2025-05-07T20:31:41.2498962Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2499035Z     
2025-05-07T20:31:41.2499122Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2499126Z 
2025-05-07T20:31:41.2499224Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2499346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2499442Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2499540Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2499912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2500083Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2500581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2500676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2501038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2501258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2501597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2501687Z     kernel = self.compile(
2025-05-07T20:31:41.2502070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2502246Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2502377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2502381Z 
2025-05-07T20:31:41.2502583Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffd0f590>
2025-05-07T20:31:41.2503356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2503852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffa2aa20>}
2025-05-07T20:31:41.2504604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2504872Z context = <triton._C.libtriton.ir.context object at 0x7f08ffd07470>
2025-05-07T20:31:41.2504877Z 
2025-05-07T20:31:41.2505039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2505302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2505404Z                            module_map=module_map)
2025-05-07T20:31:41.2505565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2505658Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2505730Z E       ^
2025-05-07T20:31:41.2506085Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2506090Z 
2025-05-07T20:31:41.2506501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2506505Z 
2025-05-07T20:31:41.2506611Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2506835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2506908Z     T=16384,
2025-05-07T20:31:41.2506985Z     D=5120,
2025-05-07T20:31:41.2507060Z     scale_ub=None,
2025-05-07T20:31:41.2507140Z     contiguous=False,
2025-05-07T20:31:41.2507230Z     compiled=True,
2025-05-07T20:31:41.2507314Z )
2025-05-07T20:31:41.2507552Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2507726Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2507731Z 
2025-05-07T20:31:41.2507804Z     @given(
2025-05-07T20:31:41.2507919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2508016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2508130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2508246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2508363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2508511Z     )
2025-05-07T20:31:41.2508758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2508847Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2508920Z         self,
2025-05-07T20:31:41.2508998Z         T: int,
2025-05-07T20:31:41.2509071Z         D: int,
2025-05-07T20:31:41.2509163Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2509249Z         contiguous: bool,
2025-05-07T20:31:41.2509329Z         compiled: bool,
2025-05-07T20:31:41.2509403Z     ) -> None:
2025-05-07T20:31:41.2509494Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2509565Z     
2025-05-07T20:31:41.2509736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2509808Z     
2025-05-07T20:31:41.2509896Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2510020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2510109Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2510196Z         x0 = x[:, :D]
2025-05-07T20:31:41.2510271Z         x1 = x[:, D:]
2025-05-07T20:31:41.2510339Z     
2025-05-07T20:31:41.2510416Z         if contiguous:
2025-05-07T20:31:41.2510505Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2510590Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2510659Z     
2025-05-07T20:31:41.2510746Z         if scale_ub is not None:
2025-05-07T20:31:41.2510847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2510980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2511054Z             )
2025-05-07T20:31:41.2511127Z         else:
2025-05-07T20:31:41.2511219Z             scale_ub_tensor = None
2025-05-07T20:31:41.2511285Z     
2025-05-07T20:31:41.2511410Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2511497Z             op = silu_mul_quant
2025-05-07T20:31:41.2511576Z             if compiled:
2025-05-07T20:31:41.2511751Z                 op = torch.compile(op)
2025-05-07T20:31:41.2511870Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2511940Z     
2025-05-07T20:31:41.2512027Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2512034Z 
2025-05-07T20:31:41.2512129Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2512253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2512351Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2512446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2512813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2512903Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2513397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2513490Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2513856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2514081Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2514422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2514511Z     kernel = self.compile(
2025-05-07T20:31:41.2514894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2515068Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2515190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2515195Z 
2025-05-07T20:31:41.2515400Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ffda0390>
2025-05-07T20:31:41.2516175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2516756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffa2bc40>}
2025-05-07T20:31:41.2517555Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2517744Z context = <triton._C.libtriton.ir.context object at 0x7f08ffd00270>
2025-05-07T20:31:41.2517748Z 
2025-05-07T20:31:41.2517913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2518173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2518276Z                            module_map=module_map)
2025-05-07T20:31:41.2518443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2518541Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2518617Z E       ^
2025-05-07T20:31:41.2518971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2518975Z 
2025-05-07T20:31:41.2519386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2519391Z 
2025-05-07T20:31:41.2519492Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2519711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2519790Z     T=2048,
2025-05-07T20:31:41.2519862Z     D=5120,
2025-05-07T20:31:41.2519937Z     scale_ub=None,
2025-05-07T20:31:41.2520023Z     contiguous=False,
2025-05-07T20:31:41.2520102Z     compiled=True,
2025-05-07T20:31:41.2520171Z )
2025-05-07T20:31:41.2520465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2520640Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2520644Z 
2025-05-07T20:31:41.2520715Z     @given(
2025-05-07T20:31:41.2520833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2520929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2521041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2521153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2521260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2521332Z     )
2025-05-07T20:31:41.2521572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2521662Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2521734Z         self,
2025-05-07T20:31:41.2521807Z         T: int,
2025-05-07T20:31:41.2521877Z         D: int,
2025-05-07T20:31:41.2521973Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2522063Z         contiguous: bool,
2025-05-07T20:31:41.2522147Z         compiled: bool,
2025-05-07T20:31:41.2522224Z     ) -> None:
2025-05-07T20:31:41.2522316Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2522385Z     
2025-05-07T20:31:41.2522550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2522621Z     
2025-05-07T20:31:41.2522710Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2522830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2522914Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2522992Z         x0 = x[:, :D]
2025-05-07T20:31:41.2523069Z         x1 = x[:, D:]
2025-05-07T20:31:41.2523137Z     
2025-05-07T20:31:41.2523219Z         if contiguous:
2025-05-07T20:31:41.2523387Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2523473Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2523545Z     
2025-05-07T20:31:41.2523632Z         if scale_ub is not None:
2025-05-07T20:31:41.2523740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2523980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2524054Z             )
2025-05-07T20:31:41.2524128Z         else:
2025-05-07T20:31:41.2529219Z             scale_ub_tensor = None
2025-05-07T20:31:41.2529298Z     
2025-05-07T20:31:41.2529443Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2529534Z             op = silu_mul_quant
2025-05-07T20:31:41.2529616Z             if compiled:
2025-05-07T20:31:41.2529717Z                 op = torch.compile(op)
2025-05-07T20:31:41.2529821Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2529891Z     
2025-05-07T20:31:41.2529983Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2529988Z 
2025-05-07T20:31:41.2530081Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2530214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2530310Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2530413Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2530790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2530880Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2531371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2531468Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2531823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2532046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2532384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2532476Z     kernel = self.compile(
2025-05-07T20:31:41.2532959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2533138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2533262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2533273Z 
2025-05-07T20:31:41.2533476Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff699290>
2025-05-07T20:31:41.2534243Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2534743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffdb87c0>}
2025-05-07T20:31:41.2535496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2535690Z context = <triton._C.libtriton.ir.context object at 0x7f08ff691170>
2025-05-07T20:31:41.2535695Z 
2025-05-07T20:31:41.2535857Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2536118Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2536229Z                            module_map=module_map)
2025-05-07T20:31:41.2536388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2536486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2536557Z E       ^
2025-05-07T20:31:41.2536911Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2536916Z 
2025-05-07T20:31:41.2537332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2537422Z 
2025-05-07T20:31:41.2537525Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2537743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2537824Z     T=2048,
2025-05-07T20:31:41.2537897Z     D=5120,
2025-05-07T20:31:41.2537980Z     scale_ub=1200.0,
2025-05-07T20:31:41.2538063Z     contiguous=False,
2025-05-07T20:31:41.2538141Z     compiled=True,
2025-05-07T20:31:41.2538216Z )
2025-05-07T20:31:41.2538645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2538893Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2538901Z 
2025-05-07T20:31:41.2538979Z     @given(
2025-05-07T20:31:41.2539096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2539193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2539307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2539426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2539546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2539618Z     )
2025-05-07T20:31:41.2539859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2539951Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2540023Z         self,
2025-05-07T20:31:41.2540095Z         T: int,
2025-05-07T20:31:41.2540169Z         D: int,
2025-05-07T20:31:41.2540263Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2540348Z         contiguous: bool,
2025-05-07T20:31:41.2540431Z         compiled: bool,
2025-05-07T20:31:41.2540506Z     ) -> None:
2025-05-07T20:31:41.2540598Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2540670Z     
2025-05-07T20:31:41.2540836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2540909Z     
2025-05-07T20:31:41.2540996Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2541298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2541395Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2541472Z         x0 = x[:, :D]
2025-05-07T20:31:41.2541548Z         x1 = x[:, D:]
2025-05-07T20:31:41.2541622Z     
2025-05-07T20:31:41.2541702Z         if contiguous:
2025-05-07T20:31:41.2541791Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2541876Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2541947Z     
2025-05-07T20:31:41.2542033Z         if scale_ub is not None:
2025-05-07T20:31:41.2542142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2542275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2542356Z             )
2025-05-07T20:31:41.2542428Z         else:
2025-05-07T20:31:41.2542518Z             scale_ub_tensor = None
2025-05-07T20:31:41.2542589Z     
2025-05-07T20:31:41.2542717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2542802Z             op = silu_mul_quant
2025-05-07T20:31:41.2542895Z             if compiled:
2025-05-07T20:31:41.2542996Z                 op = torch.compile(op)
2025-05-07T20:31:41.2543097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2543168Z     
2025-05-07T20:31:41.2543256Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2543261Z 
2025-05-07T20:31:41.2543354Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2543481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2543579Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2543678Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2544045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2544135Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2544631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2544724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2545210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2545438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2545775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2545870Z     kernel = self.compile(
2025-05-07T20:31:41.2546252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2546422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2546549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2546553Z 
2025-05-07T20:31:41.2546754Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff6add50>
2025-05-07T20:31:41.2547530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2548031Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffdb98a0>}
2025-05-07T20:31:41.2548778Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2548966Z context = <triton._C.libtriton.ir.context object at 0x7f08ff68dc30>
2025-05-07T20:31:41.2548971Z 
2025-05-07T20:31:41.2549131Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2549398Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2549581Z                            module_map=module_map)
2025-05-07T20:31:41.2549748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2549843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2549915Z E       ^
2025-05-07T20:31:41.2550270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2550275Z 
2025-05-07T20:31:41.2550685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2550690Z 
2025-05-07T20:31:41.2550786Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2551007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2551080Z     T=4096,
2025-05-07T20:31:41.2551151Z     D=5120,
2025-05-07T20:31:41.2551232Z     scale_ub=1200.0,
2025-05-07T20:31:41.2551308Z     contiguous=True,
2025-05-07T20:31:41.2551394Z     compiled=True,
2025-05-07T20:31:41.2551463Z )
2025-05-07T20:31:41.2551687Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2551859Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2551864Z 
2025-05-07T20:31:41.2551936Z     @given(
2025-05-07T20:31:41.2552049Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2552147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2552259Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2552370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2552482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2552555Z     )
2025-05-07T20:31:41.2552802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2552892Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2552964Z         self,
2025-05-07T20:31:41.2553039Z         T: int,
2025-05-07T20:31:41.2553111Z         D: int,
2025-05-07T20:31:41.2553209Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2553378Z         contiguous: bool,
2025-05-07T20:31:41.2553457Z         compiled: bool,
2025-05-07T20:31:41.2553533Z     ) -> None:
2025-05-07T20:31:41.2553625Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2553692Z     
2025-05-07T20:31:41.2553862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2553931Z     
2025-05-07T20:31:41.2554017Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2554140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2554224Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2554298Z         x0 = x[:, :D]
2025-05-07T20:31:41.2554374Z         x1 = x[:, D:]
2025-05-07T20:31:41.2554440Z     
2025-05-07T20:31:41.2554517Z         if contiguous:
2025-05-07T20:31:41.2554606Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2554691Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2554762Z     
2025-05-07T20:31:41.2554851Z         if scale_ub is not None:
2025-05-07T20:31:41.2554959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2555092Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2555165Z             )
2025-05-07T20:31:41.2555237Z         else:
2025-05-07T20:31:41.2555332Z             scale_ub_tensor = None
2025-05-07T20:31:41.2555400Z     
2025-05-07T20:31:41.2555528Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2555615Z             op = silu_mul_quant
2025-05-07T20:31:41.2555693Z             if compiled:
2025-05-07T20:31:41.2555785Z                 op = torch.compile(op)
2025-05-07T20:31:41.2555886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2555952Z     
2025-05-07T20:31:41.2556039Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2556043Z 
2025-05-07T20:31:41.2556135Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2556258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2556434Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2556536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2556903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2556994Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2557487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2557584Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2557939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2558159Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2558498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2558588Z     kernel = self.compile(
2025-05-07T20:31:41.2558976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2559157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2559280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2559285Z 
2025-05-07T20:31:41.2559490Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff66abd0>
2025-05-07T20:31:41.2560256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2560751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ffdbaac0>}
2025-05-07T20:31:41.2561508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2561882Z context = <triton._C.libtriton.ir.context object at 0x7f08ff676ab0>
2025-05-07T20:31:41.2561887Z 
2025-05-07T20:31:41.2562051Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2562309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2562413Z                            module_map=module_map)
2025-05-07T20:31:41.2562569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2562662Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2562741Z E       ^
2025-05-07T20:31:41.2563091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2563096Z 
2025-05-07T20:31:41.2563643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2563654Z 
2025-05-07T20:31:41.2563754Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2563971Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2564042Z     T=128,
2025-05-07T20:31:41.2564113Z     D=5120,
2025-05-07T20:31:41.2564191Z     scale_ub=1200.0,
2025-05-07T20:31:41.2564276Z     contiguous=False,
2025-05-07T20:31:41.2564351Z     compiled=True,
2025-05-07T20:31:41.2564421Z )
2025-05-07T20:31:41.2564640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2564807Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2564812Z 
2025-05-07T20:31:41.2564884Z     @given(
2025-05-07T20:31:41.2565001Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2565096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2565289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2565405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2565513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2565587Z     )
2025-05-07T20:31:41.2565827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2565913Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2565990Z         self,
2025-05-07T20:31:41.2566061Z         T: int,
2025-05-07T20:31:41.2566131Z         D: int,
2025-05-07T20:31:41.2566226Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2566311Z         contiguous: bool,
2025-05-07T20:31:41.2566391Z         compiled: bool,
2025-05-07T20:31:41.2566469Z     ) -> None:
2025-05-07T20:31:41.2566558Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2566629Z     
2025-05-07T20:31:41.2566794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2566863Z     
2025-05-07T20:31:41.2566959Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2567084Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2567166Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2567241Z         x0 = x[:, :D]
2025-05-07T20:31:41.2567315Z         x1 = x[:, D:]
2025-05-07T20:31:41.2567381Z     
2025-05-07T20:31:41.2567463Z         if contiguous:
2025-05-07T20:31:41.2567547Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2567630Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2567704Z     
2025-05-07T20:31:41.2567790Z         if scale_ub is not None:
2025-05-07T20:31:41.2567895Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2568024Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2568095Z             )
2025-05-07T20:31:41.2568171Z         else:
2025-05-07T20:31:41.2568259Z             scale_ub_tensor = None
2025-05-07T20:31:41.2568328Z     
2025-05-07T20:31:41.2568454Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2568545Z             op = silu_mul_quant
2025-05-07T20:31:41.2568708Z             if compiled:
2025-05-07T20:31:41.2568805Z                 op = torch.compile(op)
2025-05-07T20:31:41.2568905Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2568970Z     
2025-05-07T20:31:41.2569057Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2569061Z 
2025-05-07T20:31:41.2569152Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2569280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2569376Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2569471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2569843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2569931Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2570429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2570534Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2570891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2571114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2571451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2571539Z     kernel = self.compile(
2025-05-07T20:31:41.2571923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2572093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2572214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2572223Z 
2025-05-07T20:31:41.2572427Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff86b850>
2025-05-07T20:31:41.2573276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2573778Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff80c540>}
2025-05-07T20:31:41.2574524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2574710Z context = <triton._C.libtriton.ir.context object at 0x7f08ff801330>
2025-05-07T20:31:41.2574715Z 
2025-05-07T20:31:41.2574874Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2575138Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2575246Z                            module_map=module_map)
2025-05-07T20:31:41.2575402Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2575498Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2575570Z E       ^
2025-05-07T20:31:41.2575919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2575923Z 
2025-05-07T20:31:41.2576337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2576341Z 
2025-05-07T20:31:41.2576439Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2576659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2576734Z     T=16384,
2025-05-07T20:31:41.2576805Z     D=7168,
2025-05-07T20:31:41.2576884Z     scale_ub=1200.0,
2025-05-07T20:31:41.2576963Z     contiguous=True,
2025-05-07T20:31:41.2577045Z     compiled=True,
2025-05-07T20:31:41.2577197Z )
2025-05-07T20:31:41.2577410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2577581Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2577586Z 
2025-05-07T20:31:41.2577666Z     @given(
2025-05-07T20:31:41.2577778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2577871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2577984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2578095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2578207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2578274Z     )
2025-05-07T20:31:41.2578515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2578606Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2578678Z         self,
2025-05-07T20:31:41.2578748Z         T: int,
2025-05-07T20:31:41.2578833Z         D: int,
2025-05-07T20:31:41.2578925Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2579008Z         contiguous: bool,
2025-05-07T20:31:41.2579092Z         compiled: bool,
2025-05-07T20:31:41.2579162Z     ) -> None:
2025-05-07T20:31:41.2579249Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2579321Z     
2025-05-07T20:31:41.2579484Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2579557Z     
2025-05-07T20:31:41.2579642Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2579760Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2579846Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2579920Z         x0 = x[:, :D]
2025-05-07T20:31:41.2579994Z         x1 = x[:, D:]
2025-05-07T20:31:41.2580067Z     
2025-05-07T20:31:41.2580143Z         if contiguous:
2025-05-07T20:31:41.2580228Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2580312Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2581050Z     
2025-05-07T20:31:41.2581150Z         if scale_ub is not None:
2025-05-07T20:31:41.2581254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2581385Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2581456Z             )
2025-05-07T20:31:41.2581528Z         else:
2025-05-07T20:31:41.2581616Z             scale_ub_tensor = None
2025-05-07T20:31:41.2581687Z     
2025-05-07T20:31:41.2581813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2581895Z             op = silu_mul_quant
2025-05-07T20:31:41.2581977Z             if compiled:
2025-05-07T20:31:41.2582071Z                 op = torch.compile(op)
2025-05-07T20:31:41.2582172Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2582242Z     
2025-05-07T20:31:41.2582328Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2582333Z 
2025-05-07T20:31:41.2582427Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2582559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2582658Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2582755Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2583123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2583209Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2583704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2583795Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2584149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2584373Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2584711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2584812Z     kernel = self.compile(
2025-05-07T20:31:41.2585273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2585443Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2585569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2585573Z 
2025-05-07T20:31:41.2585777Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff8ac410>
2025-05-07T20:31:41.2586545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2587044Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff80d080>}
2025-05-07T20:31:41.2587794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2587993Z context = <triton._C.libtriton.ir.context object at 0x7f08ff8e42b0>
2025-05-07T20:31:41.2587998Z 
2025-05-07T20:31:41.2588159Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2588420Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2588521Z                            module_map=module_map)
2025-05-07T20:31:41.2588679Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2588776Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2588845Z E       ^
2025-05-07T20:31:41.2589195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2589200Z 
2025-05-07T20:31:41.2589717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2589722Z 
2025-05-07T20:31:41.2589822Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2590042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2590115Z     T=16384,
2025-05-07T20:31:41.2590185Z     D=5120,
2025-05-07T20:31:41.2590274Z     scale_ub=1200.0,
2025-05-07T20:31:41.2590355Z     contiguous=True,
2025-05-07T20:31:41.2590431Z     compiled=False,
2025-05-07T20:31:41.2590503Z )
2025-05-07T20:31:41.2590716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2590889Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2590893Z 
2025-05-07T20:31:41.2590966Z     @given(
2025-05-07T20:31:41.2591080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2591181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2591294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2591404Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2591516Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2591584Z     )
2025-05-07T20:31:41.2591824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2591915Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2591987Z         self,
2025-05-07T20:31:41.2592055Z         T: int,
2025-05-07T20:31:41.2592129Z         D: int,
2025-05-07T20:31:41.2592220Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2592306Z         contiguous: bool,
2025-05-07T20:31:41.2592385Z         compiled: bool,
2025-05-07T20:31:41.2592460Z     ) -> None:
2025-05-07T20:31:41.2592552Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2592620Z     
2025-05-07T20:31:41.2592787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2592946Z     
2025-05-07T20:31:41.2593033Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2593152Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2593238Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2593312Z         x0 = x[:, :D]
2025-05-07T20:31:41.2593386Z         x1 = x[:, D:]
2025-05-07T20:31:41.2593456Z     
2025-05-07T20:31:41.2593533Z         if contiguous:
2025-05-07T20:31:41.2593617Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2593708Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2593776Z     
2025-05-07T20:31:41.2593864Z         if scale_ub is not None:
2025-05-07T20:31:41.2593964Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2594094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2594168Z             )
2025-05-07T20:31:41.2594237Z         else:
2025-05-07T20:31:41.2594324Z             scale_ub_tensor = None
2025-05-07T20:31:41.2594396Z     
2025-05-07T20:31:41.2594526Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2594614Z             op = silu_mul_quant
2025-05-07T20:31:41.2594697Z             if compiled:
2025-05-07T20:31:41.2594790Z                 op = torch.compile(op)
2025-05-07T20:31:41.2594889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2594960Z     
2025-05-07T20:31:41.2595044Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2595048Z 
2025-05-07T20:31:41.2595142Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2595264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2595359Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2595453Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2595953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2596046Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2596488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2596716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2597056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2597145Z     kernel = self.compile(
2025-05-07T20:31:41.2597576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2597749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2597871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2597876Z 
2025-05-07T20:31:41.2598079Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff780e90>
2025-05-07T20:31:41.2598851Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2599348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff80e660>}
2025-05-07T20:31:41.2600099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2600285Z context = <triton._C.libtriton.ir.context object at 0x7f08ff73cd70>
2025-05-07T20:31:41.2600289Z 
2025-05-07T20:31:41.2600452Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2600709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2600811Z                            module_map=module_map)
2025-05-07T20:31:41.2600974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2601145Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2601221Z E       ^
2025-05-07T20:31:41.2601574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2601578Z 
2025-05-07T20:31:41.2601989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2601993Z 
2025-05-07T20:31:41.2602094Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2602313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2602385Z     T=1,
2025-05-07T20:31:41.2602459Z     D=7168,
2025-05-07T20:31:41.2602536Z     scale_ub=1200.0,
2025-05-07T20:31:41.2602622Z     contiguous=False,
2025-05-07T20:31:41.2602702Z     compiled=False,
2025-05-07T20:31:41.2602769Z )
2025-05-07T20:31:41.2602989Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2603158Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2603163Z 
2025-05-07T20:31:41.2603235Z     @given(
2025-05-07T20:31:41.2603481Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2603575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2603682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2603795Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2603903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2603976Z     )
2025-05-07T20:31:41.2604214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2604301Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2604375Z         self,
2025-05-07T20:31:41.2604447Z         T: int,
2025-05-07T20:31:41.2604515Z         D: int,
2025-05-07T20:31:41.2604610Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2604775Z         contiguous: bool,
2025-05-07T20:31:41.2604862Z         compiled: bool,
2025-05-07T20:31:41.2604935Z     ) -> None:
2025-05-07T20:31:41.2605024Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2605094Z     
2025-05-07T20:31:41.2605262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2605332Z     
2025-05-07T20:31:41.2605421Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2605540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2605621Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2605697Z         x0 = x[:, :D]
2025-05-07T20:31:41.2605770Z         x1 = x[:, D:]
2025-05-07T20:31:41.2605838Z     
2025-05-07T20:31:41.2605921Z         if contiguous:
2025-05-07T20:31:41.2606008Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2606092Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2606159Z     
2025-05-07T20:31:41.2606244Z         if scale_ub is not None:
2025-05-07T20:31:41.2606351Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2606486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2606557Z             )
2025-05-07T20:31:41.2606628Z         else:
2025-05-07T20:31:41.2606718Z             scale_ub_tensor = None
2025-05-07T20:31:41.2606785Z     
2025-05-07T20:31:41.2606913Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2606995Z             op = silu_mul_quant
2025-05-07T20:31:41.2607073Z             if compiled:
2025-05-07T20:31:41.2607172Z                 op = torch.compile(op)
2025-05-07T20:31:41.2607292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2607367Z     
2025-05-07T20:31:41.2607474Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2607479Z 
2025-05-07T20:31:41.2607576Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2607698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2607796Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2607895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2608479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2608571Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2608927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2609152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2609491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2609579Z     kernel = self.compile(
2025-05-07T20:31:41.2609969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2610139Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2610271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2610281Z 
2025-05-07T20:31:41.2610484Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff7c7810>
2025-05-07T20:31:41.2611251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2611748Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff80dd00>}
2025-05-07T20:31:41.2612494Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2612689Z context = <triton._C.libtriton.ir.context object at 0x7f08ff7876f0>
2025-05-07T20:31:41.2612768Z 
2025-05-07T20:31:41.2612938Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2613204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2613304Z                            module_map=module_map)
2025-05-07T20:31:41.2613462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2613557Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2613630Z E       ^
2025-05-07T20:31:41.2613980Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2613985Z 
2025-05-07T20:31:41.2614400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2614404Z 
2025-05-07T20:31:41.2614501Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2614726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2614803Z     T=4096,
2025-05-07T20:31:41.2614874Z     D=7168,
2025-05-07T20:31:41.2614953Z     scale_ub=1200.0,
2025-05-07T20:31:41.2615038Z     contiguous=False,
2025-05-07T20:31:41.2615114Z     compiled=True,
2025-05-07T20:31:41.2615187Z )
2025-05-07T20:31:41.2615400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2615573Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2615580Z 
2025-05-07T20:31:41.2615654Z     @given(
2025-05-07T20:31:41.2615769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2615863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2615971Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2616084Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2616194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2616262Z     )
2025-05-07T20:31:41.2616506Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2616679Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2616750Z         self,
2025-05-07T20:31:41.2616821Z         T: int,
2025-05-07T20:31:41.2616894Z         D: int,
2025-05-07T20:31:41.2616986Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2617072Z         contiguous: bool,
2025-05-07T20:31:41.2617152Z         compiled: bool,
2025-05-07T20:31:41.2617227Z     ) -> None:
2025-05-07T20:31:41.2617321Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2617387Z     
2025-05-07T20:31:41.2617554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2617630Z     
2025-05-07T20:31:41.2617718Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2617838Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2617925Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2617998Z         x0 = x[:, :D]
2025-05-07T20:31:41.2618072Z         x1 = x[:, D:]
2025-05-07T20:31:41.2618153Z     
2025-05-07T20:31:41.2618229Z         if contiguous:
2025-05-07T20:31:41.2618318Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2618400Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2618468Z     
2025-05-07T20:31:41.2618555Z         if scale_ub is not None:
2025-05-07T20:31:41.2618655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2618783Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2618855Z             )
2025-05-07T20:31:41.2618926Z         else:
2025-05-07T20:31:41.2619014Z             scale_ub_tensor = None
2025-05-07T20:31:41.2619083Z     
2025-05-07T20:31:41.2619207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2619290Z             op = silu_mul_quant
2025-05-07T20:31:41.2619374Z             if compiled:
2025-05-07T20:31:41.2619469Z                 op = torch.compile(op)
2025-05-07T20:31:41.2619570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2619634Z     
2025-05-07T20:31:41.2619829Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2619834Z 
2025-05-07T20:31:41.2619927Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2620051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2620144Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2620245Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2620613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2620699Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2621201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2621292Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2621655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2621878Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2622225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2622318Z     kernel = self.compile(
2025-05-07T20:31:41.2622699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2622871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2622994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2622998Z 
2025-05-07T20:31:41.2623202Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff9753d0>
2025-05-07T20:31:41.2623973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2624470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff74ccc0>}
2025-05-07T20:31:41.2625297Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2625483Z context = <triton._C.libtriton.ir.context object at 0x7f08ff99a030>
2025-05-07T20:31:41.2625488Z 
2025-05-07T20:31:41.2625647Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2625911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2626013Z                            module_map=module_map)
2025-05-07T20:31:41.2626176Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2626267Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2626348Z E       ^
2025-05-07T20:31:41.2626702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2626707Z 
2025-05-07T20:31:41.2627119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2627123Z 
2025-05-07T20:31:41.2627224Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2627441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2627514Z     T=128,
2025-05-07T20:31:41.2627589Z     D=7168,
2025-05-07T20:31:41.2627665Z     scale_ub=1200.0,
2025-05-07T20:31:41.2627744Z     contiguous=False,
2025-05-07T20:31:41.2627824Z     compiled=True,
2025-05-07T20:31:41.2627891Z )
2025-05-07T20:31:41.2628102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2628353Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.2628364Z 
2025-05-07T20:31:41.2628437Z     @given(
2025-05-07T20:31:41.2628554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2628647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2628755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2628868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2628975Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2629044Z     )
2025-05-07T20:31:41.2629287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2629375Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2629449Z         self,
2025-05-07T20:31:41.2629524Z         T: int,
2025-05-07T20:31:41.2629595Z         D: int,
2025-05-07T20:31:41.2629687Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2629775Z         contiguous: bool,
2025-05-07T20:31:41.2629857Z         compiled: bool,
2025-05-07T20:31:41.2629937Z     ) -> None:
2025-05-07T20:31:41.2630028Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2630094Z     
2025-05-07T20:31:41.2630264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2630334Z     
2025-05-07T20:31:41.2630422Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2630550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2630632Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2630705Z         x0 = x[:, :D]
2025-05-07T20:31:41.2630782Z         x1 = x[:, D:]
2025-05-07T20:31:41.2630850Z     
2025-05-07T20:31:41.2630926Z         if contiguous:
2025-05-07T20:31:41.2631015Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2631101Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2631172Z     
2025-05-07T20:31:41.2631259Z         if scale_ub is not None:
2025-05-07T20:31:41.2631361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2631497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2631663Z             )
2025-05-07T20:31:41.2631736Z         else:
2025-05-07T20:31:41.2631826Z             scale_ub_tensor = None
2025-05-07T20:31:41.2631895Z     
2025-05-07T20:31:41.2632021Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2632105Z             op = silu_mul_quant
2025-05-07T20:31:41.2632186Z             if compiled:
2025-05-07T20:31:41.2632281Z                 op = torch.compile(op)
2025-05-07T20:31:41.2632383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2632451Z     
2025-05-07T20:31:41.2632539Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2632544Z 
2025-05-07T20:31:41.2632636Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2632762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2632862Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2632956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2633328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2633422Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2633913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2634011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2634369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2634589Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2634929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2635019Z     kernel = self.compile(
2025-05-07T20:31:41.2635400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2635658Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2635786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2635790Z 
2025-05-07T20:31:41.2635994Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff902a10>
2025-05-07T20:31:41.2636763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2637255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff74d580>}
2025-05-07T20:31:41.2638009Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2638199Z context = <triton._C.libtriton.ir.context object at 0x7f08ff9ee8b0>
2025-05-07T20:31:41.2638209Z 
2025-05-07T20:31:41.2638530Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2638880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2638987Z                            module_map=module_map)
2025-05-07T20:31:41.2639144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2639234Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2639311Z E       ^
2025-05-07T20:31:41.2639662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2639667Z 
2025-05-07T20:31:41.2640082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2640086Z 
2025-05-07T20:31:41.2640192Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2640417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2640643Z     T=2048,
2025-05-07T20:31:41.2640715Z     D=7168,
2025-05-07T20:31:41.2640790Z     scale_ub=None,
2025-05-07T20:31:41.2640872Z     contiguous=True,
2025-05-07T20:31:41.2640951Z     compiled=True,
2025-05-07T20:31:41.2641020Z )
2025-05-07T20:31:41.2641239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2641403Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2641408Z 
2025-05-07T20:31:41.2641481Z     @given(
2025-05-07T20:31:41.2641601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2641695Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2641810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2641920Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2642029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2642115Z     )
2025-05-07T20:31:41.2642357Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2642447Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2642519Z         self,
2025-05-07T20:31:41.2642592Z         T: int,
2025-05-07T20:31:41.2642663Z         D: int,
2025-05-07T20:31:41.2642759Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2642843Z         contiguous: bool,
2025-05-07T20:31:41.2642922Z         compiled: bool,
2025-05-07T20:31:41.2642996Z     ) -> None:
2025-05-07T20:31:41.2643083Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2643153Z     
2025-05-07T20:31:41.2643402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2643472Z     
2025-05-07T20:31:41.2643559Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2643681Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2643765Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2643970Z         x0 = x[:, :D]
2025-05-07T20:31:41.2644055Z         x1 = x[:, D:]
2025-05-07T20:31:41.2644125Z     
2025-05-07T20:31:41.2644205Z         if contiguous:
2025-05-07T20:31:41.2644293Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2644378Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2644450Z     
2025-05-07T20:31:41.2644536Z         if scale_ub is not None:
2025-05-07T20:31:41.2644640Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2644769Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2644841Z             )
2025-05-07T20:31:41.2644914Z         else:
2025-05-07T20:31:41.2645003Z             scale_ub_tensor = None
2025-05-07T20:31:41.2645071Z     
2025-05-07T20:31:41.2645198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2645282Z             op = silu_mul_quant
2025-05-07T20:31:41.2645361Z             if compiled:
2025-05-07T20:31:41.2645459Z                 op = torch.compile(op)
2025-05-07T20:31:41.2645566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2645639Z     
2025-05-07T20:31:41.2645726Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2645731Z 
2025-05-07T20:31:41.2645824Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2645952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2651163Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2651283Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2651666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2651756Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2652252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2652350Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2652719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2653078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2653418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2653508Z     kernel = self.compile(
2025-05-07T20:31:41.2653903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2654075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2654200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2654209Z 
2025-05-07T20:31:41.2654414Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff4a13d0>
2025-05-07T20:31:41.2655186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2655692Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff74e480>}
2025-05-07T20:31:41.2656439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2656633Z context = <triton._C.libtriton.ir.context object at 0x7f08ff4d7670>
2025-05-07T20:31:41.2656637Z 
2025-05-07T20:31:41.2656801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2657062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2657168Z                            module_map=module_map)
2025-05-07T20:31:41.2657327Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2657508Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2657581Z E       ^
2025-05-07T20:31:41.2657934Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2657938Z 
2025-05-07T20:31:41.2658353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2658357Z 
2025-05-07T20:31:41.2658454Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2658680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2658756Z     T=16384,
2025-05-07T20:31:41.2658827Z     D=5120,
2025-05-07T20:31:41.2658908Z     scale_ub=None,
2025-05-07T20:31:41.2658992Z     contiguous=False,
2025-05-07T20:31:41.2659069Z     compiled=False,
2025-05-07T20:31:41.2659143Z )
2025-05-07T20:31:41.2659360Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2659540Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2659549Z 
2025-05-07T20:31:41.2659624Z     @given(
2025-05-07T20:31:41.2659741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2659836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2659955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2660068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2660178Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2660252Z     )
2025-05-07T20:31:41.2660493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2660585Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2660661Z         self,
2025-05-07T20:31:41.2660732Z         T: int,
2025-05-07T20:31:41.2660810Z         D: int,
2025-05-07T20:31:41.2660903Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2660989Z         contiguous: bool,
2025-05-07T20:31:41.2661082Z         compiled: bool,
2025-05-07T20:31:41.2661248Z     ) -> None:
2025-05-07T20:31:41.2661337Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2661405Z     
2025-05-07T20:31:41.2661572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2661644Z     
2025-05-07T20:31:41.2661730Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2661850Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2663665Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2663676Z 
2025-05-07T20:31:41.2663791Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:41.2663796Z 
2025-05-07T20:31:41.2663896Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2664116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2664190Z     T=4096,
2025-05-07T20:31:41.2664261Z     D=7168,
2025-05-07T20:31:41.2664343Z     scale_ub=1200.0,
2025-05-07T20:31:41.2664419Z     contiguous=True,
2025-05-07T20:31:41.2664499Z     compiled=True,
2025-05-07T20:31:41.2664568Z )
2025-05-07T20:31:41.2664782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2664949Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2664954Z 
2025-05-07T20:31:41.2665025Z     @given(
2025-05-07T20:31:41.2665141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2665318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2665432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2665545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2665653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2665730Z     )
2025-05-07T20:31:41.2665969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2666058Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2666136Z         self,
2025-05-07T20:31:41.2666209Z         T: int,
2025-05-07T20:31:41.2666277Z         D: int,
2025-05-07T20:31:41.2666371Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2666455Z         contiguous: bool,
2025-05-07T20:31:41.2666536Z         compiled: bool,
2025-05-07T20:31:41.2666616Z     ) -> None:
2025-05-07T20:31:41.2666706Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2666773Z     
2025-05-07T20:31:41.2666941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2667017Z     
2025-05-07T20:31:41.2667110Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2667255Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2669069Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2669078Z 
2025-05-07T20:31:41.2669191Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:41.2669195Z 
2025-05-07T20:31:41.2669292Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2669520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2669677Z     T=16384,
2025-05-07T20:31:41.2669752Z     D=7168,
2025-05-07T20:31:41.2669837Z     scale_ub=None,
2025-05-07T20:31:41.2669920Z     contiguous=False,
2025-05-07T20:31:41.2669998Z     compiled=False,
2025-05-07T20:31:41.2670069Z )
2025-05-07T20:31:41.2670280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2670455Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2670463Z 
2025-05-07T20:31:41.2670536Z     @given(
2025-05-07T20:31:41.2670647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2670744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2670856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2670970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2671085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2671155Z     )
2025-05-07T20:31:41.2671406Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2671501Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2671576Z         self,
2025-05-07T20:31:41.2671651Z         T: int,
2025-05-07T20:31:41.2671723Z         D: int,
2025-05-07T20:31:41.2671814Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2671902Z         contiguous: bool,
2025-05-07T20:31:41.2671979Z         compiled: bool,
2025-05-07T20:31:41.2672050Z     ) -> None:
2025-05-07T20:31:41.2672141Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2672210Z     
2025-05-07T20:31:41.2672373Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2674244Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2674256Z 
2025-05-07T20:31:41.2674371Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2674375Z 
2025-05-07T20:31:41.2674479Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2674699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2674773Z     T=2048,
2025-05-07T20:31:41.2674847Z     D=7168,
2025-05-07T20:31:41.2674922Z     scale_ub=1200.0,
2025-05-07T20:31:41.2675006Z     contiguous=True,
2025-05-07T20:31:41.2675082Z     compiled=True,
2025-05-07T20:31:41.2675152Z )
2025-05-07T20:31:41.2675367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2675540Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2675548Z 
2025-05-07T20:31:41.2675628Z     @given(
2025-05-07T20:31:41.2675743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2675838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2675949Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2676060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2676170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2676243Z     )
2025-05-07T20:31:41.2676484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2676570Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2676645Z         self,
2025-05-07T20:31:41.2676719Z         T: int,
2025-05-07T20:31:41.2676793Z         D: int,
2025-05-07T20:31:41.2676885Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2676965Z         contiguous: bool,
2025-05-07T20:31:41.2677047Z         compiled: bool,
2025-05-07T20:31:41.2677210Z     ) -> None:
2025-05-07T20:31:41.2677316Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2677396Z     
2025-05-07T20:31:41.2677585Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2677654Z     
2025-05-07T20:31:41.2677740Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2677860Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2679635Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2679651Z 
2025-05-07T20:31:41.2679763Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:41.2679768Z 
2025-05-07T20:31:41.2679866Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2680085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2680157Z     T=2048,
2025-05-07T20:31:41.2680234Z     D=7168,
2025-05-07T20:31:41.2680312Z     scale_ub=None,
2025-05-07T20:31:41.2680394Z     contiguous=True,
2025-05-07T20:31:41.2680476Z     compiled=False,
2025-05-07T20:31:41.2680544Z )
2025-05-07T20:31:41.2680756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2680926Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2680931Z 
2025-05-07T20:31:41.2681001Z     @given(
2025-05-07T20:31:41.2681117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2681210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2681396Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2681515Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2681624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2681694Z     )
2025-05-07T20:31:41.2681938Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2682029Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2682101Z         self,
2025-05-07T20:31:41.2682175Z         T: int,
2025-05-07T20:31:41.2682246Z         D: int,
2025-05-07T20:31:41.2682337Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2682422Z         contiguous: bool,
2025-05-07T20:31:41.2682500Z         compiled: bool,
2025-05-07T20:31:41.2682574Z     ) -> None:
2025-05-07T20:31:41.2682663Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2682729Z     
2025-05-07T20:31:41.2682895Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2682964Z     
2025-05-07T20:31:41.2683061Z >       x_sign = torch.sign(x)
2025-05-07T20:31:41.2684938Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2684944Z 
2025-05-07T20:31:41.2685056Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:41.2685061Z 
2025-05-07T20:31:41.2685160Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2685377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2685450Z     T=1,
2025-05-07T20:31:41.2685535Z     D=7168,
2025-05-07T20:31:41.2685721Z     scale_ub=1200.0,
2025-05-07T20:31:41.2685805Z     contiguous=True,
2025-05-07T20:31:41.2685885Z     compiled=False,
2025-05-07T20:31:41.2685954Z )
2025-05-07T20:31:41.2686170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2686332Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2686337Z 
2025-05-07T20:31:41.2686408Z     @given(
2025-05-07T20:31:41.2686522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2686615Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2686721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2686833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2686941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2687016Z     )
2025-05-07T20:31:41.2687256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2687350Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2687430Z         self,
2025-05-07T20:31:41.2687502Z         T: int,
2025-05-07T20:31:41.2687573Z         D: int,
2025-05-07T20:31:41.2687668Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2687749Z         contiguous: bool,
2025-05-07T20:31:41.2687827Z         compiled: bool,
2025-05-07T20:31:41.2687903Z     ) -> None:
2025-05-07T20:31:41.2687998Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2688067Z     
2025-05-07T20:31:41.2688243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2688312Z     
2025-05-07T20:31:41.2688397Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2688521Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2688604Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2688686Z         x0 = x[:, :D]
2025-05-07T20:31:41.2688760Z         x1 = x[:, D:]
2025-05-07T20:31:41.2688827Z     
2025-05-07T20:31:41.2688907Z         if contiguous:
2025-05-07T20:31:41.2689156Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2689245Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2689316Z     
2025-05-07T20:31:41.2689403Z         if scale_ub is not None:
2025-05-07T20:31:41.2689504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2689637Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2689708Z             )
2025-05-07T20:31:41.2689780Z         else:
2025-05-07T20:31:41.2689873Z             scale_ub_tensor = None
2025-05-07T20:31:41.2689940Z     
2025-05-07T20:31:41.2690069Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2690153Z             op = silu_mul_quant
2025-05-07T20:31:41.2690233Z             if compiled:
2025-05-07T20:31:41.2690332Z                 op = torch.compile(op)
2025-05-07T20:31:41.2690435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2690505Z     
2025-05-07T20:31:41.2690596Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2690600Z 
2025-05-07T20:31:41.2690703Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2690830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2690925Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2691020Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2691521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2691613Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2691973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2692196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2692544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2692634Z     kernel = self.compile(
2025-05-07T20:31:41.2693024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2693282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2693408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2693413Z 
2025-05-07T20:31:41.2693614Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff4b1710>
2025-05-07T20:31:41.2694385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2694885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff489d00>}
2025-05-07T20:31:41.2695634Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2695830Z context = <triton._C.libtriton.ir.context object at 0x7f08ff501570>
2025-05-07T20:31:41.2695835Z 
2025-05-07T20:31:41.2695997Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2696261Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2696365Z                            module_map=module_map)
2025-05-07T20:31:41.2696521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2696617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2696689Z E       ^
2025-05-07T20:31:41.2697043Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2697048Z 
2025-05-07T20:31:41.2697541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2697550Z 
2025-05-07T20:31:41.2697648Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2697870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2697942Z     T=128,
2025-05-07T20:31:41.2698013Z     D=5120,
2025-05-07T20:31:41.2698093Z     scale_ub=None,
2025-05-07T20:31:41.2698170Z     contiguous=True,
2025-05-07T20:31:41.2698247Z     compiled=False,
2025-05-07T20:31:41.2698323Z )
2025-05-07T20:31:41.2698536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2698701Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2698709Z 
2025-05-07T20:31:41.2698781Z     @given(
2025-05-07T20:31:41.2698893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2698988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2699104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2699222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2699333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2699400Z     )
2025-05-07T20:31:41.2699639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2699729Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2699799Z         self,
2025-05-07T20:31:41.2699871Z         T: int,
2025-05-07T20:31:41.2699949Z         D: int,
2025-05-07T20:31:41.2700039Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2700124Z         contiguous: bool,
2025-05-07T20:31:41.2700203Z         compiled: bool,
2025-05-07T20:31:41.2700274Z     ) -> None:
2025-05-07T20:31:41.2700369Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2700437Z     
2025-05-07T20:31:41.2700604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2700676Z     
2025-05-07T20:31:41.2700764Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2700975Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2701063Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2701139Z         x0 = x[:, :D]
2025-05-07T20:31:41.2701212Z         x1 = x[:, D:]
2025-05-07T20:31:41.2701288Z     
2025-05-07T20:31:41.2701364Z         if contiguous:
2025-05-07T20:31:41.2701450Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2701532Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2701600Z     
2025-05-07T20:31:41.2701690Z         if scale_ub is not None:
2025-05-07T20:31:41.2701788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2701918Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2701994Z             )
2025-05-07T20:31:41.2702065Z         else:
2025-05-07T20:31:41.2702156Z             scale_ub_tensor = None
2025-05-07T20:31:41.2702225Z     
2025-05-07T20:31:41.2702351Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2702441Z             op = silu_mul_quant
2025-05-07T20:31:41.2702532Z             if compiled:
2025-05-07T20:31:41.2702628Z                 op = torch.compile(op)
2025-05-07T20:31:41.2702733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2702802Z     
2025-05-07T20:31:41.2702886Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2702891Z 
2025-05-07T20:31:41.2702989Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2703114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2703207Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2703301Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2703800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2703892Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2704254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2704557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2704899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2704989Z     kernel = self.compile(
2025-05-07T20:31:41.2705373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2705547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2705672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2705676Z 
2025-05-07T20:31:41.2705881Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff51fd90>
2025-05-07T20:31:41.2706655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2707158Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff48ae80>}
2025-05-07T20:31:41.2707910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2708096Z context = <triton._C.libtriton.ir.context object at 0x7f08ff577c30>
2025-05-07T20:31:41.2708100Z 
2025-05-07T20:31:41.2708261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2708521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2708623Z                            module_map=module_map)
2025-05-07T20:31:41.2708783Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2708880Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2709035Z E       ^
2025-05-07T20:31:41.2709389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2709393Z 
2025-05-07T20:31:41.2709806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2709810Z 
2025-05-07T20:31:41.2709911Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2710135Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2710208Z     T=128,
2025-05-07T20:31:41.2710284Z     D=7168,
2025-05-07T20:31:41.2710363Z     scale_ub=None,
2025-05-07T20:31:41.2710445Z     contiguous=True,
2025-05-07T20:31:41.2710525Z     compiled=False,
2025-05-07T20:31:41.2710592Z )
2025-05-07T20:31:41.2710807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2710979Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2710989Z 
2025-05-07T20:31:41.2711062Z     @given(
2025-05-07T20:31:41.2711178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2711271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2711381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2711494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2711601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2711672Z     )
2025-05-07T20:31:41.2711920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2712009Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2712081Z         self,
2025-05-07T20:31:41.2712156Z         T: int,
2025-05-07T20:31:41.2712229Z         D: int,
2025-05-07T20:31:41.2712324Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2712406Z         contiguous: bool,
2025-05-07T20:31:41.2712484Z         compiled: bool,
2025-05-07T20:31:41.2712665Z     ) -> None:
2025-05-07T20:31:41.2712754Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2712824Z     
2025-05-07T20:31:41.2712994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2713067Z     
2025-05-07T20:31:41.2713155Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2713280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2713364Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2713437Z         x0 = x[:, :D]
2025-05-07T20:31:41.2713512Z         x1 = x[:, D:]
2025-05-07T20:31:41.2713579Z     
2025-05-07T20:31:41.2713657Z         if contiguous:
2025-05-07T20:31:41.2713746Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2713830Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2713896Z     
2025-05-07T20:31:41.2713982Z         if scale_ub is not None:
2025-05-07T20:31:41.2714083Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2714222Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2714297Z             )
2025-05-07T20:31:41.2714370Z         else:
2025-05-07T20:31:41.2714462Z             scale_ub_tensor = None
2025-05-07T20:31:41.2714530Z     
2025-05-07T20:31:41.2714655Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2714743Z             op = silu_mul_quant
2025-05-07T20:31:41.2714822Z             if compiled:
2025-05-07T20:31:41.2714917Z                 op = torch.compile(op)
2025-05-07T20:31:41.2715021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2715087Z     
2025-05-07T20:31:41.2715177Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2715182Z 
2025-05-07T20:31:41.2715275Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2715399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2715499Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2715593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2716096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2716274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2716632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2716855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2717196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2717284Z     kernel = self.compile(
2025-05-07T20:31:41.2717671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2717843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2717964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2717972Z 
2025-05-07T20:31:41.2718182Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff54a650>
2025-05-07T20:31:41.2718951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2719449Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff48bec0>}
2025-05-07T20:31:41.2720195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2720386Z context = <triton._C.libtriton.ir.context object at 0x7f08ff54e4b0>
2025-05-07T20:31:41.2720391Z 
2025-05-07T20:31:41.2720626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2720893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2720998Z                            module_map=module_map)
2025-05-07T20:31:41.2721157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2721251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2721321Z E       ^
2025-05-07T20:31:41.2721671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2721675Z 
2025-05-07T20:31:41.2722089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2722094Z 
2025-05-07T20:31:41.2722190Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2722407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2722483Z     T=2048,
2025-05-07T20:31:41.2722560Z     D=7168,
2025-05-07T20:31:41.2722646Z     scale_ub=1200.0,
2025-05-07T20:31:41.2722726Z     contiguous=True,
2025-05-07T20:31:41.2722801Z     compiled=False,
2025-05-07T20:31:41.2722872Z )
2025-05-07T20:31:41.2723084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2723346Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2723351Z 
2025-05-07T20:31:41.2723430Z     @given(
2025-05-07T20:31:41.2723542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2723637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2723749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2723862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2723972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2724040Z     )
2025-05-07T20:31:41.2724279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2724462Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2724534Z         self,
2025-05-07T20:31:41.2724604Z         T: int,
2025-05-07T20:31:41.2724676Z         D: int,
2025-05-07T20:31:41.2724768Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2724853Z         contiguous: bool,
2025-05-07T20:31:41.2724934Z         compiled: bool,
2025-05-07T20:31:41.2725005Z     ) -> None:
2025-05-07T20:31:41.2725093Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2725163Z     
2025-05-07T20:31:41.2725332Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2727123Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2727135Z 
2025-05-07T20:31:41.2727248Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2727253Z 
2025-05-07T20:31:41.2727350Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2727569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2727642Z     T=1,
2025-05-07T20:31:41.2727718Z     D=5120,
2025-05-07T20:31:41.2727797Z     scale_ub=1200.0,
2025-05-07T20:31:41.2727878Z     contiguous=True,
2025-05-07T20:31:41.2727961Z     compiled=False,
2025-05-07T20:31:41.2728032Z )
2025-05-07T20:31:41.2728244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2728410Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2728415Z 
2025-05-07T20:31:41.2728485Z     @given(
2025-05-07T20:31:41.2728686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2728781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2728889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2729003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2729112Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2729181Z     )
2025-05-07T20:31:41.2729425Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2729514Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2729586Z         self,
2025-05-07T20:31:41.2729659Z         T: int,
2025-05-07T20:31:41.2729730Z         D: int,
2025-05-07T20:31:41.2729826Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2729907Z         contiguous: bool,
2025-05-07T20:31:41.2729985Z         compiled: bool,
2025-05-07T20:31:41.2730061Z     ) -> None:
2025-05-07T20:31:41.2730149Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2730222Z     
2025-05-07T20:31:41.2730393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2730463Z     
2025-05-07T20:31:41.2730550Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2730672Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2730757Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2730833Z         x0 = x[:, :D]
2025-05-07T20:31:41.2730909Z         x1 = x[:, D:]
2025-05-07T20:31:41.2730978Z     
2025-05-07T20:31:41.2731064Z         if contiguous:
2025-05-07T20:31:41.2731153Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2731236Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2731312Z     
2025-05-07T20:31:41.2731395Z         if scale_ub is not None:
2025-05-07T20:31:41.2731495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2731627Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2731699Z             )
2025-05-07T20:31:41.2731771Z         else:
2025-05-07T20:31:41.2731865Z             scale_ub_tensor = None
2025-05-07T20:31:41.2732013Z     
2025-05-07T20:31:41.2732138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2732226Z             op = silu_mul_quant
2025-05-07T20:31:41.2732310Z             if compiled:
2025-05-07T20:31:41.2732405Z                 op = torch.compile(op)
2025-05-07T20:31:41.2732509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2732577Z     
2025-05-07T20:31:41.2732671Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2732676Z 
2025-05-07T20:31:41.2732771Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2732896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2732994Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2733087Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2733586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2733690Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2734049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2734273Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2734614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2734704Z     kernel = self.compile(
2025-05-07T20:31:41.2735089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2735260Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2735383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2735391Z 
2025-05-07T20:31:41.2735594Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff355e10>
2025-05-07T20:31:41.2736439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2736946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff5b94e0>}
2025-05-07T20:31:41.2737694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2737884Z context = <triton._C.libtriton.ir.context object at 0x7f08ff351cf0>
2025-05-07T20:31:41.2737889Z 
2025-05-07T20:31:41.2738051Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2738319Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2738681Z                            module_map=module_map)
2025-05-07T20:31:41.2738880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2738979Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2739050Z E       ^
2025-05-07T20:31:41.2739402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2739407Z 
2025-05-07T20:31:41.2739822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2739826Z 
2025-05-07T20:31:41.2739925Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2740146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2740220Z     T=2048,
2025-05-07T20:31:41.2740291Z     D=5120,
2025-05-07T20:31:41.2740369Z     scale_ub=None,
2025-05-07T20:31:41.2740449Z     contiguous=True,
2025-05-07T20:31:41.2740536Z     compiled=False,
2025-05-07T20:31:41.2740765Z )
2025-05-07T20:31:41.2740983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2741151Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2741155Z 
2025-05-07T20:31:41.2741232Z     @given(
2025-05-07T20:31:41.2741346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2741442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2741555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2741667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2741778Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2741846Z     )
2025-05-07T20:31:41.2742084Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2742176Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2742250Z         self,
2025-05-07T20:31:41.2742320Z         T: int,
2025-05-07T20:31:41.2742405Z         D: int,
2025-05-07T20:31:41.2742497Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2742583Z         contiguous: bool,
2025-05-07T20:31:41.2742663Z         compiled: bool,
2025-05-07T20:31:41.2742738Z     ) -> None:
2025-05-07T20:31:41.2742828Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2742899Z     
2025-05-07T20:31:41.2743063Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2743135Z     
2025-05-07T20:31:41.2743223Z >       x_sign = torch.sign(x)
2025-05-07T20:31:41.2745149Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2745164Z 
2025-05-07T20:31:41.2745279Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:41.2745284Z 
2025-05-07T20:31:41.2745381Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2745605Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2745679Z     T=16384,
2025-05-07T20:31:41.2745750Z     D=5120,
2025-05-07T20:31:41.2745829Z     scale_ub=None,
2025-05-07T20:31:41.2745908Z     contiguous=True,
2025-05-07T20:31:41.2745986Z     compiled=False,
2025-05-07T20:31:41.2746055Z )
2025-05-07T20:31:41.2746269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2746445Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2746450Z 
2025-05-07T20:31:41.2746525Z     @given(
2025-05-07T20:31:41.2746644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2746746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2746854Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2746965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2747075Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2747144Z     )
2025-05-07T20:31:41.2747384Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2747474Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2747547Z         self,
2025-05-07T20:31:41.2747620Z         T: int,
2025-05-07T20:31:41.2747693Z         D: int,
2025-05-07T20:31:41.2747784Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2747870Z         contiguous: bool,
2025-05-07T20:31:41.2747951Z         compiled: bool,
2025-05-07T20:31:41.2748022Z     ) -> None:
2025-05-07T20:31:41.2748115Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2748184Z     
2025-05-07T20:31:41.2748353Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2750237Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2750243Z 
2025-05-07T20:31:41.2750357Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2750362Z 
2025-05-07T20:31:41.2750459Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2750676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2750758Z     T=4096,
2025-05-07T20:31:41.2750836Z     D=5120,
2025-05-07T20:31:41.2750913Z     scale_ub=None,
2025-05-07T20:31:41.2750993Z     contiguous=True,
2025-05-07T20:31:41.2751073Z     compiled=False,
2025-05-07T20:31:41.2751141Z )
2025-05-07T20:31:41.2751355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2751520Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2751525Z 
2025-05-07T20:31:41.2751597Z     @given(
2025-05-07T20:31:41.2751711Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2751806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2751916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2752027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2752135Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2752207Z     )
2025-05-07T20:31:41.2752525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2752621Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2752696Z         self,
2025-05-07T20:31:41.2752766Z         T: int,
2025-05-07T20:31:41.2752834Z         D: int,
2025-05-07T20:31:41.2752931Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2753014Z         contiguous: bool,
2025-05-07T20:31:41.2753096Z         compiled: bool,
2025-05-07T20:31:41.2753174Z     ) -> None:
2025-05-07T20:31:41.2753261Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2753334Z     
2025-05-07T20:31:41.2753497Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2755269Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2755283Z 
2025-05-07T20:31:41.2755398Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2755403Z 
2025-05-07T20:31:41.2755497Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2755718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2755788Z     T=2048,
2025-05-07T20:31:41.2755860Z     D=5120,
2025-05-07T20:31:41.2755940Z     scale_ub=None,
2025-05-07T20:31:41.2756021Z     contiguous=False,
2025-05-07T20:31:41.2756097Z     compiled=False,
2025-05-07T20:31:41.2756172Z )
2025-05-07T20:31:41.2756384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2756557Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2756562Z 
2025-05-07T20:31:41.2756722Z     @given(
2025-05-07T20:31:41.2756834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2756934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2757041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2757151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2757261Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2757330Z     )
2025-05-07T20:31:41.2757568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2757658Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2757730Z         self,
2025-05-07T20:31:41.2757802Z         T: int,
2025-05-07T20:31:41.2757870Z         D: int,
2025-05-07T20:31:41.2757963Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2758049Z         contiguous: bool,
2025-05-07T20:31:41.2758128Z         compiled: bool,
2025-05-07T20:31:41.2758202Z     ) -> None:
2025-05-07T20:31:41.2758299Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2758372Z     
2025-05-07T20:31:41.2758532Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2760303Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2760309Z 
2025-05-07T20:31:41.2760424Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2760429Z 
2025-05-07T20:31:41.2760527Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2760820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2760902Z     T=4096,
2025-05-07T20:31:41.2760973Z     D=7168,
2025-05-07T20:31:41.2761050Z     scale_ub=None,
2025-05-07T20:31:41.2761130Z     contiguous=True,
2025-05-07T20:31:41.2761206Z     compiled=True,
2025-05-07T20:31:41.2761276Z )
2025-05-07T20:31:41.2761492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2761657Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2761661Z 
2025-05-07T20:31:41.2761735Z     @given(
2025-05-07T20:31:41.2761850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2761943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2762055Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2762166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2762272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2762345Z     )
2025-05-07T20:31:41.2762590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2762680Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2762757Z         self,
2025-05-07T20:31:41.2762830Z         T: int,
2025-05-07T20:31:41.2762900Z         D: int,
2025-05-07T20:31:41.2762997Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2763080Z         contiguous: bool,
2025-05-07T20:31:41.2763158Z         compiled: bool,
2025-05-07T20:31:41.2763232Z     ) -> None:
2025-05-07T20:31:41.2763454Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2763526Z     
2025-05-07T20:31:41.2763688Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2765467Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2765562Z 
2025-05-07T20:31:41.2765676Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2765680Z 
2025-05-07T20:31:41.2765776Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2765996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2766070Z     T=2048,
2025-05-07T20:31:41.2766141Z     D=5120,
2025-05-07T20:31:41.2766221Z     scale_ub=1200.0,
2025-05-07T20:31:41.2766303Z     contiguous=False,
2025-05-07T20:31:41.2766380Z     compiled=False,
2025-05-07T20:31:41.2766454Z )
2025-05-07T20:31:41.2766667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2766845Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2766858Z 
2025-05-07T20:31:41.2766931Z     @given(
2025-05-07T20:31:41.2767043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2767140Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2767246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2767355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2767465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2767535Z     )
2025-05-07T20:31:41.2767775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2767866Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2767936Z         self,
2025-05-07T20:31:41.2768011Z         T: int,
2025-05-07T20:31:41.2768083Z         D: int,
2025-05-07T20:31:41.2768174Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2768259Z         contiguous: bool,
2025-05-07T20:31:41.2768338Z         compiled: bool,
2025-05-07T20:31:41.2768492Z     ) -> None:
2025-05-07T20:31:41.2768591Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2768660Z     
2025-05-07T20:31:41.2768822Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2770587Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2770592Z 
2025-05-07T20:31:41.2770702Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2770707Z 
2025-05-07T20:31:41.2770810Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2771030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2771107Z     T=4096,
2025-05-07T20:31:41.2771180Z     D=7168,
2025-05-07T20:31:41.2771256Z     scale_ub=1200.0,
2025-05-07T20:31:41.2771337Z     contiguous=True,
2025-05-07T20:31:41.2771414Z     compiled=False,
2025-05-07T20:31:41.2771483Z )
2025-05-07T20:31:41.2771702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2771872Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2771877Z 
2025-05-07T20:31:41.2771949Z     @given(
2025-05-07T20:31:41.2772065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2772159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2777719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2777865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2777987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2778170Z     )
2025-05-07T20:31:41.2778417Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2778508Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2778586Z         self,
2025-05-07T20:31:41.2778657Z         T: int,
2025-05-07T20:31:41.2778727Z         D: int,
2025-05-07T20:31:41.2778821Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2778906Z         contiguous: bool,
2025-05-07T20:31:41.2778985Z         compiled: bool,
2025-05-07T20:31:41.2779061Z     ) -> None:
2025-05-07T20:31:41.2779151Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2779225Z     
2025-05-07T20:31:41.2779393Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2781182Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2781198Z 
2025-05-07T20:31:41.2781313Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2781319Z 
2025-05-07T20:31:41.2781415Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2781641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2781715Z     T=16384,
2025-05-07T20:31:41.2781793Z     D=7168,
2025-05-07T20:31:41.2781870Z     scale_ub=None,
2025-05-07T20:31:41.2781952Z     contiguous=False,
2025-05-07T20:31:41.2782029Z     compiled=True,
2025-05-07T20:31:41.2782102Z )
2025-05-07T20:31:41.2782395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2782577Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.2782583Z 
2025-05-07T20:31:41.2782657Z     @given(
2025-05-07T20:31:41.2782773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2782873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2782985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2783096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2783207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2783276Z     )
2025-05-07T20:31:41.2783522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2783609Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2783681Z         self,
2025-05-07T20:31:41.2783757Z         T: int,
2025-05-07T20:31:41.2783831Z         D: int,
2025-05-07T20:31:41.2783925Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2784017Z         contiguous: bool,
2025-05-07T20:31:41.2784100Z         compiled: bool,
2025-05-07T20:31:41.2784172Z     ) -> None:
2025-05-07T20:31:41.2784274Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2784345Z     
2025-05-07T20:31:41.2784509Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2786282Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2786287Z 
2025-05-07T20:31:41.2786401Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2786497Z 
2025-05-07T20:31:41.2786602Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2786822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2786902Z     T=4096,
2025-05-07T20:31:41.2786977Z     D=7168,
2025-05-07T20:31:41.2787055Z     scale_ub=None,
2025-05-07T20:31:41.2787138Z     contiguous=True,
2025-05-07T20:31:41.2787217Z     compiled=False,
2025-05-07T20:31:41.2787286Z )
2025-05-07T20:31:41.2787500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2787667Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2787671Z 
2025-05-07T20:31:41.2787744Z     @given(
2025-05-07T20:31:41.2787860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2787954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2788066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2788182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2788296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2788367Z     )
2025-05-07T20:31:41.2788606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2788694Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2788769Z         self,
2025-05-07T20:31:41.2788839Z         T: int,
2025-05-07T20:31:41.2788907Z         D: int,
2025-05-07T20:31:41.2789003Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2789087Z         contiguous: bool,
2025-05-07T20:31:41.2789165Z         compiled: bool,
2025-05-07T20:31:41.2789241Z     ) -> None:
2025-05-07T20:31:41.2789329Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2789402Z     
2025-05-07T20:31:41.2789564Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2791414Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2791431Z 
2025-05-07T20:31:41.2791547Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2791551Z 
2025-05-07T20:31:41.2791649Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2791869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2791945Z     T=16384,
2025-05-07T20:31:41.2792017Z     D=7168,
2025-05-07T20:31:41.2792098Z     scale_ub=None,
2025-05-07T20:31:41.2792179Z     contiguous=True,
2025-05-07T20:31:41.2792260Z     compiled=False,
2025-05-07T20:31:41.2792333Z )
2025-05-07T20:31:41.2792556Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2792730Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.2792735Z 
2025-05-07T20:31:41.2792809Z     @given(
2025-05-07T20:31:41.2792922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2793018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2793126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2793236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2793347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2793416Z     )
2025-05-07T20:31:41.2793663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2793754Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2793831Z         self,
2025-05-07T20:31:41.2793906Z         T: int,
2025-05-07T20:31:41.2793976Z         D: int,
2025-05-07T20:31:41.2794073Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2794239Z         contiguous: bool,
2025-05-07T20:31:41.2794320Z         compiled: bool,
2025-05-07T20:31:41.2794390Z     ) -> None:
2025-05-07T20:31:41.2794484Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2794552Z     
2025-05-07T20:31:41.2794716Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2796487Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2796499Z 
2025-05-07T20:31:41.2796615Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2796623Z 
2025-05-07T20:31:41.2796721Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2796940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2797019Z     T=16384,
2025-05-07T20:31:41.2797094Z     D=7168,
2025-05-07T20:31:41.2797173Z     scale_ub=1200.0,
2025-05-07T20:31:41.2797258Z     contiguous=True,
2025-05-07T20:31:41.2797338Z     compiled=False,
2025-05-07T20:31:41.2797408Z )
2025-05-07T20:31:41.2797622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2797794Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2797798Z 
2025-05-07T20:31:41.2797872Z     @given(
2025-05-07T20:31:41.2797991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2798085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2798275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2798398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2798508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2798584Z     )
2025-05-07T20:31:41.2798824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2798911Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2798990Z         self,
2025-05-07T20:31:41.2799059Z         T: int,
2025-05-07T20:31:41.2799133Z         D: int,
2025-05-07T20:31:41.2799226Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2799309Z         contiguous: bool,
2025-05-07T20:31:41.2799394Z         compiled: bool,
2025-05-07T20:31:41.2799468Z     ) -> None:
2025-05-07T20:31:41.2799560Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2799630Z     
2025-05-07T20:31:41.2799794Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2801568Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2801579Z 
2025-05-07T20:31:41.2801690Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2801694Z 
2025-05-07T20:31:41.2801794Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2802014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2802084Z     T=128,
2025-05-07T20:31:41.2802162Z     D=5120,
2025-05-07T20:31:41.2802239Z     scale_ub=1200.0,
2025-05-07T20:31:41.2802321Z     contiguous=False,
2025-05-07T20:31:41.2802488Z     compiled=False,
2025-05-07T20:31:41.2802559Z )
2025-05-07T20:31:41.2802770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2802941Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2802946Z 
2025-05-07T20:31:41.2803018Z     @given(
2025-05-07T20:31:41.2803134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2803225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2803440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2803557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2803666Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2803735Z     )
2025-05-07T20:31:41.2803976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2804064Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2804137Z         self,
2025-05-07T20:31:41.2804217Z         T: int,
2025-05-07T20:31:41.2804295Z         D: int,
2025-05-07T20:31:41.2804387Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2804473Z         contiguous: bool,
2025-05-07T20:31:41.2804552Z         compiled: bool,
2025-05-07T20:31:41.2804625Z     ) -> None:
2025-05-07T20:31:41.2804715Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2804784Z     
2025-05-07T20:31:41.2804949Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2805021Z     
2025-05-07T20:31:41.2805107Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2805230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2805312Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2805385Z         x0 = x[:, :D]
2025-05-07T20:31:41.2805463Z         x1 = x[:, D:]
2025-05-07T20:31:41.2805534Z     
2025-05-07T20:31:41.2805612Z         if contiguous:
2025-05-07T20:31:41.2805702Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2805901Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2805975Z     
2025-05-07T20:31:41.2806066Z         if scale_ub is not None:
2025-05-07T20:31:41.2806166Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2806298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2806370Z             )
2025-05-07T20:31:41.2806441Z         else:
2025-05-07T20:31:41.2806536Z             scale_ub_tensor = None
2025-05-07T20:31:41.2806604Z     
2025-05-07T20:31:41.2806733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2806819Z             op = silu_mul_quant
2025-05-07T20:31:41.2806897Z             if compiled:
2025-05-07T20:31:41.2806990Z                 op = torch.compile(op)
2025-05-07T20:31:41.2807098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2807164Z     
2025-05-07T20:31:41.2807250Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2807258Z 
2025-05-07T20:31:41.2807353Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2807484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2807586Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2807682Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2808181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2808278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2808637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2808859Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2809200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2809290Z     kernel = self.compile(
2025-05-07T20:31:41.2809676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2809936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2810060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2810065Z 
2025-05-07T20:31:41.2810487Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff156590>
2025-05-07T20:31:41.2811262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2811764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff150220>}
2025-05-07T20:31:41.2812519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2812716Z context = <triton._C.libtriton.ir.context object at 0x7f08ff1621b0>
2025-05-07T20:31:41.2812720Z 
2025-05-07T20:31:41.2812881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2813144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2813249Z                            module_map=module_map)
2025-05-07T20:31:41.2813409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2813503Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2813581Z E       ^
2025-05-07T20:31:41.2813934Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2813939Z 
2025-05-07T20:31:41.2814355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2814359Z 
2025-05-07T20:31:41.2814545Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2814767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2814845Z     T=2048,
2025-05-07T20:31:41.2814917Z     D=7168,
2025-05-07T20:31:41.2814993Z     scale_ub=None,
2025-05-07T20:31:41.2815078Z     contiguous=False,
2025-05-07T20:31:41.2815158Z     compiled=False,
2025-05-07T20:31:41.2815230Z )
2025-05-07T20:31:41.2815443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2815611Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2815615Z 
2025-05-07T20:31:41.2815692Z     @given(
2025-05-07T20:31:41.2815806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2815899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2816010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2816122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2816242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2816312Z     )
2025-05-07T20:31:41.2816551Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2816643Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2816717Z         self,
2025-05-07T20:31:41.2816787Z         T: int,
2025-05-07T20:31:41.2816863Z         D: int,
2025-05-07T20:31:41.2816956Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2817039Z         contiguous: bool,
2025-05-07T20:31:41.2817122Z         compiled: bool,
2025-05-07T20:31:41.2817194Z     ) -> None:
2025-05-07T20:31:41.2817283Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2817354Z     
2025-05-07T20:31:41.2817518Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2819307Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2819444Z 
2025-05-07T20:31:41.2819562Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2819567Z 
2025-05-07T20:31:41.2819667Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2819886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2819962Z     T=128,
2025-05-07T20:31:41.2820036Z     D=7168,
2025-05-07T20:31:41.2820120Z     scale_ub=1200.0,
2025-05-07T20:31:41.2820201Z     contiguous=True,
2025-05-07T20:31:41.2820280Z     compiled=True,
2025-05-07T20:31:41.2820351Z )
2025-05-07T20:31:41.2820568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2820737Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2820741Z 
2025-05-07T20:31:41.2820818Z     @given(
2025-05-07T20:31:41.2820932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2821027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2821138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2821249Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2821362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2821436Z     )
2025-05-07T20:31:41.2821677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2821768Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2821840Z         self,
2025-05-07T20:31:41.2821911Z         T: int,
2025-05-07T20:31:41.2821984Z         D: int,
2025-05-07T20:31:41.2822078Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2822322Z         contiguous: bool,
2025-05-07T20:31:41.2822407Z         compiled: bool,
2025-05-07T20:31:41.2822479Z     ) -> None:
2025-05-07T20:31:41.2822571Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2822643Z     
2025-05-07T20:31:41.2822806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2822878Z     
2025-05-07T20:31:41.2822963Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2823081Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2823170Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2823248Z         x0 = x[:, :D]
2025-05-07T20:31:41.2823322Z         x1 = x[:, D:]
2025-05-07T20:31:41.2823392Z     
2025-05-07T20:31:41.2823472Z         if contiguous:
2025-05-07T20:31:41.2823559Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2823646Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2823715Z     
2025-05-07T20:31:41.2823801Z         if scale_ub is not None:
2025-05-07T20:31:41.2823911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2824046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2824123Z             )
2025-05-07T20:31:41.2824195Z         else:
2025-05-07T20:31:41.2824288Z             scale_ub_tensor = None
2025-05-07T20:31:41.2824361Z     
2025-05-07T20:31:41.2824485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2824571Z             op = silu_mul_quant
2025-05-07T20:31:41.2824659Z             if compiled:
2025-05-07T20:31:41.2824753Z                 op = torch.compile(op)
2025-05-07T20:31:41.2824852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2824920Z     
2025-05-07T20:31:41.2825009Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2825013Z 
2025-05-07T20:31:41.2825105Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2825230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2825325Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2825513Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2825883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.2825974Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.2826471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2826563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2826924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2827148Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2827487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2827580Z     kernel = self.compile(
2025-05-07T20:31:41.2827968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2828145Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2828271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2828275Z 
2025-05-07T20:31:41.2828478Z self = <triton.compiler.compiler.ASTSource object at 0x7f08ff17db10>
2025-05-07T20:31:41.2829254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2829747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f096b238400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08ff150860>}
2025-05-07T20:31:41.2830568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2830765Z context = <triton._C.libtriton.ir.context object at 0x7f08ff1694b0>
2025-05-07T20:31:41.2830769Z 
2025-05-07T20:31:41.2830929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2831191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2831293Z                            module_map=module_map)
2025-05-07T20:31:41.2831448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2831544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2831617Z E       ^
2025-05-07T20:31:41.2831971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2831976Z 
2025-05-07T20:31:41.2832393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2832403Z 
2025-05-07T20:31:41.2832502Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2832724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2832799Z     T=128,
2025-05-07T20:31:41.2832873Z     D=7168,
2025-05-07T20:31:41.2832953Z     scale_ub=1200.0,
2025-05-07T20:31:41.2833034Z     contiguous=True,
2025-05-07T20:31:41.2833114Z     compiled=False,
2025-05-07T20:31:41.2833184Z )
2025-05-07T20:31:41.2833396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2833564Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.2833569Z 
2025-05-07T20:31:41.2833643Z     @given(
2025-05-07T20:31:41.2833756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2833853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2833961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2834182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2834295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2834366Z     )
2025-05-07T20:31:41.2834608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2834703Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2834773Z         self,
2025-05-07T20:31:41.2834845Z         T: int,
2025-05-07T20:31:41.2834926Z         D: int,
2025-05-07T20:31:41.2835019Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2835103Z         contiguous: bool,
2025-05-07T20:31:41.2835190Z         compiled: bool,
2025-05-07T20:31:41.2835261Z     ) -> None:
2025-05-07T20:31:41.2835352Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2835422Z     
2025-05-07T20:31:41.2835585Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2835656Z     
2025-05-07T20:31:41.2835740Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2835863Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2837649Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2837655Z 
2025-05-07T20:31:41.2837772Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:41.2837776Z 
2025-05-07T20:31:41.2837880Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2838098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2838169Z     T=128,
2025-05-07T20:31:41.2838324Z     D=5120,
2025-05-07T20:31:41.2838709Z     scale_ub=1200.0,
2025-05-07T20:31:41.2838842Z     contiguous=True,
2025-05-07T20:31:41.2838921Z     compiled=True,
2025-05-07T20:31:41.2838989Z )
2025-05-07T20:31:41.2839207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2839372Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.2839377Z 
2025-05-07T20:31:41.2839449Z     @given(
2025-05-07T20:31:41.2839568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2839663Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2839770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2839885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2839994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2840066Z     )
2025-05-07T20:31:41.2840307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2840407Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2840486Z         self,
2025-05-07T20:31:41.2840558Z         T: int,
2025-05-07T20:31:41.2840627Z         D: int,
2025-05-07T20:31:41.2840724Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2840809Z         contiguous: bool,
2025-05-07T20:31:41.2840888Z         compiled: bool,
2025-05-07T20:31:41.2840962Z     ) -> None:
2025-05-07T20:31:41.2841052Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2841122Z     
2025-05-07T20:31:41.2841288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2841358Z     
2025-05-07T20:31:41.2841446Z >       x_sign = torch.sign(x)
2025-05-07T20:31:41.2843210Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2843489Z 
2025-05-07T20:31:41.2843613Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:41.2843618Z 
2025-05-07T20:31:41.2843714Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2843933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2844010Z     T=128,
2025-05-07T20:31:41.2844084Z     D=7168,
2025-05-07T20:31:41.2844164Z     scale_ub=None,
2025-05-07T20:31:41.2844250Z     contiguous=True,
2025-05-07T20:31:41.2844329Z     compiled=True,
2025-05-07T20:31:41.2844400Z )
2025-05-07T20:31:41.2844617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2844783Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2844793Z 
2025-05-07T20:31:41.2844870Z     @given(
2025-05-07T20:31:41.2844986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2845078Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2845190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2845300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2845408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2845481Z     )
2025-05-07T20:31:41.2845720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2845808Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2845884Z         self,
2025-05-07T20:31:41.2845958Z         T: int,
2025-05-07T20:31:41.2846032Z         D: int,
2025-05-07T20:31:41.2846123Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2846206Z         contiguous: bool,
2025-05-07T20:31:41.2846293Z         compiled: bool,
2025-05-07T20:31:41.2846487Z     ) -> None:
2025-05-07T20:31:41.2846582Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2846653Z     
2025-05-07T20:31:41.2846817Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2848579Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:41.2848585Z 
2025-05-07T20:31:41.2848698Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:41.2848833Z =============================== warnings summary ===============================
2025-05-07T20:31:41.2849147Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:41.2849448Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:41.2849747Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:41.2850619Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:31:41.2850846Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:31:41.2850853Z 
2025-05-07T20:31:41.2851024Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:31:41.2852291Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:31:41.2852555Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:31:41.2852560Z 
2025-05-07T20:31:41.2852768Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:31:41.2852933Z ================== 1 failed, 1 passed, 13 warnings in 22.42s ===================
2025-05-07T20:31:43.1208875Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:31:43.1874656Z 
2025-05-07T20:31:43.1875664Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:31:43.1876072Z 
2025-05-07T20:31:43.1876077Z 
2025-05-07T20:31:43.1894013Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:31:45.3457651Z ============================= test session starts ==============================
2025-05-07T20:31:45.3458355Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:45.3459377Z cachedir: .pytest_cache
2025-05-07T20:31:45.3460516Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:45.3461943Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:45.3462741Z plugins: hypothesis-6.131.14
2025-05-07T20:31:46.9647171Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:47.1167136Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:31:47.1167540Z run-last-failure: rerun previous 1 failure
2025-05-07T20:31:47.1167761Z 
2025-05-07T20:31:49.3049781Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.3050904Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:49.3052264Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.3053763Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.3054746Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.3056056Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.3057447Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.3058428Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.3059665Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.3061401Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.3062467Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.3063743Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.3065004Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.3066230Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.3067431Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:49.3068258Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.3069281Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:49.3070300Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:49.3071251Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:49.3072475Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.3073762Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.3074888Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:49.3075945Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:49.3077120Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.3078482Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.3079539Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.3080502Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.3081245Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:49.3082263Z W0507 20:31:49.303000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.3226065Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.3227135Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:49.3228473Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.3229946Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.3230929Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.3232246Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.3233629Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.3234607Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.3236017Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.3237413Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.3238702Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.3240033Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.3241287Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.3242513Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.3243887Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:49.3244706Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.3245729Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:49.3246748Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:49.3247543Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:49.3248754Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.3250185Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.3251302Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:49.3252351Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:49.3253537Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.3254890Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.3255946Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.3256858Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.3257596Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:49.3258610Z W0507 20:31:49.321000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8645329Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8646057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8646462Z     T=1,
2025-05-07T20:31:49.8646652Z     D=5120,
2025-05-07T20:31:49.8646846Z     scale_ub=None,
2025-05-07T20:31:49.8647052Z     contiguous=True,
2025-05-07T20:31:49.8647283Z     compiled=True,
2025-05-07T20:31:49.8647492Z )
2025-05-07T20:31:49.8647814Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8648300Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.8648560Z 
2025-05-07T20:31:49.8648649Z     @given(
2025-05-07T20:31:49.8648882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8649195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8649502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8649831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8650160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8650456Z     )
2025-05-07T20:31:49.8650810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8651246Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8651494Z         self,
2025-05-07T20:31:49.8651694Z         T: int,
2025-05-07T20:31:49.8651890Z         D: int,
2025-05-07T20:31:49.8652117Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8652399Z         contiguous: bool,
2025-05-07T20:31:49.8652634Z         compiled: bool,
2025-05-07T20:31:49.8652864Z     ) -> None:
2025-05-07T20:31:49.8653088Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8653333Z     
2025-05-07T20:31:49.8653605Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8655437Z     
2025-05-07T20:31:49.8655638Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8655926Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8656248Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8657164Z         x0 = x[:, :D]
2025-05-07T20:31:49.8657402Z         x1 = x[:, D:]
2025-05-07T20:31:49.8657608Z     
2025-05-07T20:31:49.8657800Z         if contiguous:
2025-05-07T20:31:49.8658036Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8658290Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8658534Z     
2025-05-07T20:31:49.8658726Z         if scale_ub is not None:
2025-05-07T20:31:49.8658997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8659334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8670088Z             )
2025-05-07T20:31:49.8670324Z         else:
2025-05-07T20:31:49.8670548Z             scale_ub_tensor = None
2025-05-07T20:31:49.8670813Z     
2025-05-07T20:31:49.8671051Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8671379Z             op = silu_mul_quant
2025-05-07T20:31:49.8671639Z             if compiled:
2025-05-07T20:31:49.8671894Z                 op = torch.compile(op)
2025-05-07T20:31:49.8672209Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8672495Z     
2025-05-07T20:31:49.8672687Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.8672984Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.8673283Z     
2025-05-07T20:31:49.8673531Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8673865Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.8674167Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.8674488Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.8674843Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8675164Z     
2025-05-07T20:31:49.8675372Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.8675569Z 
2025-05-07T20:31:49.8675672Z moe/activation_test.py:126: 
2025-05-07T20:31:49.8675984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8676452Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.8676795Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8677591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.8678359Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.8678919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8679609Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8680321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.8681058Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8681836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.8682600Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8683523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.8685627Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.8686248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.8686771Z     fn()
2025-05-07T20:31:49.8687296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.8687886Z     self.fn.run(
2025-05-07T20:31:49.8688358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8688898Z     kernel = self.compile(
2025-05-07T20:31:49.8689457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8690211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8690609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8690847Z 
2025-05-07T20:31:49.8691057Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3a0408110>
2025-05-07T20:31:49.8692147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8693535Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd39c2c7ce0>}
2025-05-07T20:31:49.8694881Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8695919Z context = <triton._C.libtriton.ir.context object at 0x7fd39c3b93f0>
2025-05-07T20:31:49.8696212Z 
2025-05-07T20:31:49.8696382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8696911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8697376Z                            module_map=module_map)
2025-05-07T20:31:49.8697751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8698118Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.8698392Z E       ^
2025-05-07T20:31:49.8698857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8699324Z 
2025-05-07T20:31:49.8699825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8700398Z 
2025-05-07T20:31:49.8700514Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8700929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8701341Z     T=2048,
2025-05-07T20:31:49.8701539Z     D=5120,
2025-05-07T20:31:49.8701742Z     scale_ub=1200.0,
2025-05-07T20:31:49.8701965Z     contiguous=True,
2025-05-07T20:31:49.8702203Z     compiled=False,
2025-05-07T20:31:49.8702418Z )
2025-05-07T20:31:50.4162816Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.4164923Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:50.4167441Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.4170118Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.4171337Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.4172646Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.4174051Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.4175453Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.4176706Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.4178106Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.4179167Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.4180459Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.4181723Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:50.4182950Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.4184163Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:50.4184985Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.4186160Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:50.4187193Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:50.4187990Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:50.4189206Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.4190490Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.4191619Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:50.4192673Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:50.4193855Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.4195217Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.4196275Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.4197185Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.4198018Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:50.4199040Z W0507 20:31:50.413000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.5216773Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.5217835Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:50.5219181Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.5220620Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.5221590Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.5222896Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.5224281Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.5225421Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.5226668Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.5228043Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.5229101Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.5230388Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.5231645Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:50.5232867Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.5234068Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:50.5234899Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.5235924Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:50.5236953Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:50.5237904Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:50.5239265Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.5240562Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.5241686Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:50.5242735Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:50.5244024Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.5245392Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.5246453Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.5247370Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.5248115Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:50.5249251Z W0507 20:31:50.519000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.9724825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:50.9725386Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:50.9725666Z 
2025-05-07T20:31:50.9725757Z     @given(
2025-05-07T20:31:50.9725997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:50.9726323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:50.9726635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:50.9726970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:50.9727295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:50.9727590Z     )
2025-05-07T20:31:50.9727957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:50.9728401Z     def test_silu_mul_quant(
2025-05-07T20:31:50.9728646Z         self,
2025-05-07T20:31:50.9728849Z         T: int,
2025-05-07T20:31:50.9729047Z         D: int,
2025-05-07T20:31:50.9729270Z         scale_ub: Optional[float],
2025-05-07T20:31:50.9729542Z         contiguous: bool,
2025-05-07T20:31:50.9729782Z         compiled: bool,
2025-05-07T20:31:50.9730015Z     ) -> None:
2025-05-07T20:31:50.9730239Z         torch.manual_seed(2025)
2025-05-07T20:31:50.9730483Z     
2025-05-07T20:31:50.9730766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:50.9731114Z     
2025-05-07T20:31:50.9731315Z         x_sign = torch.sign(x)
2025-05-07T20:31:50.9731608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:50.9731924Z         x = x_sign * x_clamp
2025-05-07T20:31:50.9732171Z         x0 = x[:, :D]
2025-05-07T20:31:50.9732386Z         x1 = x[:, D:]
2025-05-07T20:31:50.9732597Z     
2025-05-07T20:31:50.9732801Z         if contiguous:
2025-05-07T20:31:50.9733192Z             x0 = x0.contiguous()
2025-05-07T20:31:50.9733457Z             x1 = x1.contiguous()
2025-05-07T20:31:50.9733710Z     
2025-05-07T20:31:50.9733901Z         if scale_ub is not None:
2025-05-07T20:31:50.9734182Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:50.9734524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:50.9734831Z             )
2025-05-07T20:31:50.9735037Z         else:
2025-05-07T20:31:50.9735261Z             scale_ub_tensor = None
2025-05-07T20:31:50.9735511Z     
2025-05-07T20:31:50.9735753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.9736082Z             op = silu_mul_quant
2025-05-07T20:31:50.9736341Z             if compiled:
2025-05-07T20:31:50.9736588Z                 op = torch.compile(op)
2025-05-07T20:31:50.9736892Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:50.9737170Z     
2025-05-07T20:31:50.9737372Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:50.9737552Z 
2025-05-07T20:31:50.9737656Z moe/activation_test.py:117: 
2025-05-07T20:31:50.9737961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.9738289Z moe/activation_test.py:115: in fn
2025-05-07T20:31:50.9738720Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:50.9739418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:50.9740112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:50.9740710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:50.9741416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:50.9742085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:50.9742626Z     kernel = self.compile(
2025-05-07T20:31:50.9743306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:50.9743976Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.9744375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.9744616Z 
2025-05-07T20:31:50.9744829Z self = <triton.compiler.compiler.ASTSource object at 0x7fd39c10a110>
2025-05-07T20:31:50.9745917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:50.9747290Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd39c531760>}
2025-05-07T20:31:50.9748641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:50.9749678Z context = <triton._C.libtriton.ir.context object at 0x7fd39c1c96f0>
2025-05-07T20:31:50.9749975Z 
2025-05-07T20:31:50.9750167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:50.9750700Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.9751177Z                            module_map=module_map)
2025-05-07T20:31:50.9751543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.9751903Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.9752167Z E       ^
2025-05-07T20:31:50.9752634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.9753099Z 
2025-05-07T20:31:50.9753525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:50.9754171Z 
2025-05-07T20:31:50.9754278Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:50.9754694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:50.9755094Z     T=2048,
2025-05-07T20:31:50.9755291Z     D=5120,
2025-05-07T20:31:50.9755491Z     scale_ub=1200.0,
2025-05-07T20:31:50.9755711Z     contiguous=True,
2025-05-07T20:31:50.9755936Z     compiled=True,
2025-05-07T20:31:50.9756151Z )
2025-05-07T20:31:50.9756468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:50.9756971Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:50.9757250Z 
2025-05-07T20:31:50.9757328Z     @given(
2025-05-07T20:31:50.9757560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:50.9757873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:50.9758190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:50.9758527Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:50.9758852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:50.9759143Z     )
2025-05-07T20:31:50.9759496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:50.9759942Z     def test_silu_mul_quant(
2025-05-07T20:31:50.9760184Z         self,
2025-05-07T20:31:50.9760384Z         T: int,
2025-05-07T20:31:50.9760588Z         D: int,
2025-05-07T20:31:50.9760807Z         scale_ub: Optional[float],
2025-05-07T20:31:50.9761085Z         contiguous: bool,
2025-05-07T20:31:50.9761330Z         compiled: bool,
2025-05-07T20:31:50.9761556Z     ) -> None:
2025-05-07T20:31:50.9761782Z         torch.manual_seed(2025)
2025-05-07T20:31:50.9762032Z     
2025-05-07T20:31:50.9762306Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:50.9762659Z     
2025-05-07T20:31:50.9762947Z         x_sign = torch.sign(x)
2025-05-07T20:31:50.9763343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:50.9763660Z         x = x_sign * x_clamp
2025-05-07T20:31:50.9763909Z         x0 = x[:, :D]
2025-05-07T20:31:50.9764127Z         x1 = x[:, D:]
2025-05-07T20:31:50.9764341Z     
2025-05-07T20:31:50.9764535Z         if contiguous:
2025-05-07T20:31:50.9764769Z             x0 = x0.contiguous()
2025-05-07T20:31:50.9765033Z             x1 = x1.contiguous()
2025-05-07T20:31:50.9765287Z     
2025-05-07T20:31:50.9765486Z         if scale_ub is not None:
2025-05-07T20:31:50.9765769Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:50.9766111Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:50.9766425Z             )
2025-05-07T20:31:50.9766616Z         else:
2025-05-07T20:31:50.9766836Z             scale_ub_tensor = None
2025-05-07T20:31:50.9767094Z     
2025-05-07T20:31:50.9767326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.9767662Z             op = silu_mul_quant
2025-05-07T20:31:50.9767924Z             if compiled:
2025-05-07T20:31:50.9768180Z                 op = torch.compile(op)
2025-05-07T20:31:50.9768505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:50.9768797Z     
2025-05-07T20:31:50.9769000Z         y_fp8, y_scale = fn()
2025-05-07T20:31:50.9769307Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:50.9769620Z     
2025-05-07T20:31:50.9769862Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.9770215Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:50.9770530Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:50.9770863Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:50.9771231Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:50.9771558Z     
2025-05-07T20:31:50.9771776Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:50.9771972Z 
2025-05-07T20:31:50.9772177Z moe/activation_test.py:126: 
2025-05-07T20:31:50.9772490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.9772842Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:50.9773171Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:50.9773978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:50.9774749Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:50.9775314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:50.9776008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:50.9776722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:50.9777469Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:50.9778242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:50.9778996Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:50.9779736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:50.9780385Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:50.9780993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:50.9781515Z     fn()
2025-05-07T20:31:50.9782034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:50.9782623Z     self.fn.run(
2025-05-07T20:31:50.9783175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:50.9783724Z     kernel = self.compile(
2025-05-07T20:31:50.9784277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:50.9784943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.9785337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.9785578Z 
2025-05-07T20:31:50.9785790Z self = <triton.compiler.compiler.ASTSource object at 0x7fd396acea90>
2025-05-07T20:31:50.9786868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:50.9788240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396be71a0>}
2025-05-07T20:31:50.9789587Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:50.9790622Z context = <triton._C.libtriton.ir.context object at 0x7fd396ad28b0>
2025-05-07T20:31:50.9790918Z 
2025-05-07T20:31:50.9791088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:50.9791616Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.9792083Z                            module_map=module_map)
2025-05-07T20:31:50.9792455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.9792819Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:50.9793082Z E       ^
2025-05-07T20:31:50.9793555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.9794120Z 
2025-05-07T20:31:50.9794541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:50.9795057Z 
2025-05-07T20:31:50.9795170Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:50.9795581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:50.9795991Z     T=16384,
2025-05-07T20:31:50.9796184Z     D=7168,
2025-05-07T20:31:50.9796370Z     scale_ub=1200.0,
2025-05-07T20:31:50.9796597Z     contiguous=False,
2025-05-07T20:31:50.9796829Z     compiled=False,
2025-05-07T20:31:50.9797039Z )
2025-05-07T20:31:51.2944713Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.2946671Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:51.2949138Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.2951263Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.2952237Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.2953535Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.2955088Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.2956076Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.2957311Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.2958694Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.2959760Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.2961106Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.2962366Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:51.2963704Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.2964919Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:51.2965749Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.2966898Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:51.2967924Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:51.2968726Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:51.2969955Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.2971244Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.2972377Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:51.2973423Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:51.2974605Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.2975958Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.2977104Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.2978030Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.2978773Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:51.2979799Z W0507 20:31:51.292000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.3703800Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.3705906Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:51.3708582Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.3710972Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.3711951Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.3713249Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.3714636Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.3715782Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.3717015Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.3718397Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.3719459Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.3720788Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.3722048Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:51.3723363Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.3724573Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:51.3725401Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.3726539Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:51.3727576Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:51.3728381Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:51.3729599Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.3730878Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.3732000Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:51.3733052Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:51.3734234Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.3735590Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.3736647Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.3737567Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.3738560Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:51.3739584Z W0507 20:31:51.367000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.0655855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:52.0656466Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:52.0656751Z 
2025-05-07T20:31:52.0656840Z     @given(
2025-05-07T20:31:52.0657081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:52.0657411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:52.0657719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:52.0658053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:52.0658391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:52.0658690Z     )
2025-05-07T20:31:52.0659047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:52.0659489Z     def test_silu_mul_quant(
2025-05-07T20:31:52.0659737Z         self,
2025-05-07T20:31:52.0659945Z         T: int,
2025-05-07T20:31:52.0660143Z         D: int,
2025-05-07T20:31:52.0660372Z         scale_ub: Optional[float],
2025-05-07T20:31:52.0660655Z         contiguous: bool,
2025-05-07T20:31:52.0660893Z         compiled: bool,
2025-05-07T20:31:52.0661129Z     ) -> None:
2025-05-07T20:31:52.0661354Z         torch.manual_seed(2025)
2025-05-07T20:31:52.0661597Z     
2025-05-07T20:31:52.0661875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:52.0662231Z     
2025-05-07T20:31:52.0662438Z         x_sign = torch.sign(x)
2025-05-07T20:31:52.0662729Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:52.0663220Z         x = x_sign * x_clamp
2025-05-07T20:31:52.0663480Z         x0 = x[:, :D]
2025-05-07T20:31:52.0663697Z         x1 = x[:, D:]
2025-05-07T20:31:52.0663917Z     
2025-05-07T20:31:52.0664116Z         if contiguous:
2025-05-07T20:31:52.0664348Z             x0 = x0.contiguous()
2025-05-07T20:31:52.0664618Z             x1 = x1.contiguous()
2025-05-07T20:31:52.0664869Z     
2025-05-07T20:31:52.0665061Z         if scale_ub is not None:
2025-05-07T20:31:52.0665342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:52.0665686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:52.0665992Z             )
2025-05-07T20:31:52.0666196Z         else:
2025-05-07T20:31:52.0666415Z             scale_ub_tensor = None
2025-05-07T20:31:52.0666666Z     
2025-05-07T20:31:52.0666908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.0667228Z             op = silu_mul_quant
2025-05-07T20:31:52.0667489Z             if compiled:
2025-05-07T20:31:52.0667742Z                 op = torch.compile(op)
2025-05-07T20:31:52.0668049Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.0668328Z     
2025-05-07T20:31:52.0668521Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:52.0668694Z 
2025-05-07T20:31:52.0668794Z moe/activation_test.py:117: 
2025-05-07T20:31:52.0669093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.0669422Z moe/activation_test.py:115: in fn
2025-05-07T20:31:52.0669713Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.0670413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:52.0671112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:52.0671651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:52.0672339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:52.0673022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:52.0673684Z     kernel = self.compile(
2025-05-07T20:31:52.0674231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:52.0674893Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.0675290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.0675519Z 
2025-05-07T20:31:52.0675726Z self = <triton.compiler.compiler.ASTSource object at 0x7fd396326790>
2025-05-07T20:31:52.0676808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:52.0678185Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3970572e0>}
2025-05-07T20:31:52.0679541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:52.0680572Z context = <triton._C.libtriton.ir.context object at 0x7fd396ae9ef0>
2025-05-07T20:31:52.0680861Z 
2025-05-07T20:31:52.0681027Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:52.0681548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.0682021Z                            module_map=module_map)
2025-05-07T20:31:52.0682379Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.0682739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.0682999Z E       ^
2025-05-07T20:31:52.0683624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.0684089Z 
2025-05-07T20:31:52.0684514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:52.0685034Z 
2025-05-07T20:31:52.0685137Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:52.0685547Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:52.0685946Z     T=1,
2025-05-07T20:31:52.0686130Z     D=7168,
2025-05-07T20:31:52.0686324Z     scale_ub=None,
2025-05-07T20:31:52.0686536Z     contiguous=True,
2025-05-07T20:31:52.0686760Z     compiled=True,
2025-05-07T20:31:52.0686968Z )
2025-05-07T20:31:52.0687284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:52.0687773Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:52.0688035Z 
2025-05-07T20:31:52.0688112Z     @given(
2025-05-07T20:31:52.0688352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:52.0688659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:52.0688962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:52.0689290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:52.0689612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:52.0689896Z     )
2025-05-07T20:31:52.0690242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:52.0690685Z     def test_silu_mul_quant(
2025-05-07T20:31:52.0690920Z         self,
2025-05-07T20:31:52.0691117Z         T: int,
2025-05-07T20:31:52.0691313Z         D: int,
2025-05-07T20:31:52.0691523Z         scale_ub: Optional[float],
2025-05-07T20:31:52.0691793Z         contiguous: bool,
2025-05-07T20:31:52.0692030Z         compiled: bool,
2025-05-07T20:31:52.0692248Z     ) -> None:
2025-05-07T20:31:52.0692464Z         torch.manual_seed(2025)
2025-05-07T20:31:52.0692707Z     
2025-05-07T20:31:52.0693064Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:52.0693405Z     
2025-05-07T20:31:52.0693615Z         x_sign = torch.sign(x)
2025-05-07T20:31:52.0693909Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:52.0694214Z         x = x_sign * x_clamp
2025-05-07T20:31:52.0694453Z         x0 = x[:, :D]
2025-05-07T20:31:52.0694668Z         x1 = x[:, D:]
2025-05-07T20:31:52.0694868Z     
2025-05-07T20:31:52.0695056Z         if contiguous:
2025-05-07T20:31:52.0695291Z             x0 = x0.contiguous()
2025-05-07T20:31:52.0695542Z             x1 = x1.contiguous()
2025-05-07T20:31:52.0695782Z     
2025-05-07T20:31:52.0695975Z         if scale_ub is not None:
2025-05-07T20:31:52.0696240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:52.0696575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:52.0696884Z             )
2025-05-07T20:31:52.0697080Z         else:
2025-05-07T20:31:52.0697289Z             scale_ub_tensor = None
2025-05-07T20:31:52.0697548Z     
2025-05-07T20:31:52.0697779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.0698086Z             op = silu_mul_quant
2025-05-07T20:31:52.0698331Z             if compiled:
2025-05-07T20:31:52.0698577Z                 op = torch.compile(op)
2025-05-07T20:31:52.0698865Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.0699140Z     
2025-05-07T20:31:52.0699334Z         y_fp8, y_scale = fn()
2025-05-07T20:31:52.0699608Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:52.0699898Z     
2025-05-07T20:31:52.0700134Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.0700464Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:52.0700793Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:52.0701117Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:52.0701554Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.0701863Z     
2025-05-07T20:31:52.0702065Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:52.0702257Z 
2025-05-07T20:31:52.0702364Z moe/activation_test.py:126: 
2025-05-07T20:31:52.0702652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.0702985Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:52.0703313Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.0704099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:52.0704857Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:52.0705403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:52.0706086Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:52.0706777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:52.0707507Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:52.0708268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:52.0709020Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:52.0709746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:52.0710386Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:52.0710992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:52.0711511Z     fn()
2025-05-07T20:31:52.0712023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:52.0712688Z     self.fn.run(
2025-05-07T20:31:52.0713164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:52.0713687Z     kernel = self.compile(
2025-05-07T20:31:52.0714232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:52.0714891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.0715283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.0715509Z 
2025-05-07T20:31:52.0715717Z self = <triton.compiler.compiler.ASTSource object at 0x7fd396427cd0>
2025-05-07T20:31:52.0716795Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:52.0718163Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd397055440>}
2025-05-07T20:31:52.0719504Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:52.0720523Z context = <triton._C.libtriton.ir.context object at 0x7fd396414ff0>
2025-05-07T20:31:52.0720813Z 
2025-05-07T20:31:52.0720978Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:52.0721500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.0721965Z                            module_map=module_map)
2025-05-07T20:31:52.0722322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.0722760Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:52.0723032Z E       ^
2025-05-07T20:31:52.0723574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.0724032Z 
2025-05-07T20:31:52.0724451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:52.0724971Z 
2025-05-07T20:31:52.0725072Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:52.0725482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:52.0725880Z     T=4096,
2025-05-07T20:31:52.0726066Z     D=5120,
2025-05-07T20:31:52.0726257Z     scale_ub=None,
2025-05-07T20:31:52.0726466Z     contiguous=False,
2025-05-07T20:31:52.0726694Z     compiled=False,
2025-05-07T20:31:52.0726902Z )
2025-05-07T20:31:52.4424259Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.4425497Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:31:52.4426862Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.4428319Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.4429308Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.4430638Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.4432485Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.4433485Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.4434723Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.4436121Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.4437206Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.4438711Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.4439980Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:31:52.4441202Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.4442596Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:31:52.4443537Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.4444572Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:52.4445605Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:31:52.4446404Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:52.4447632Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.4448932Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.4450068Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:52.4451172Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:31:52.4452368Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.4453746Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.4454953Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.4455876Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.4456623Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:31:52.4457663Z W0507 20:31:52.440000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.7113204Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.7114335Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:31:52.7115698Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.7117173Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.7118159Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.7119923Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.7121467Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.7122451Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.7123800Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.7125184Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.7126255Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.7127546Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.7128802Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:31:52.7130021Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.7131284Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:31:52.7132126Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.7133335Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:52.7134351Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:31:52.7135149Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:52.7136366Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.7137660Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.7139059Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:52.7140103Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:31:52.7141294Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.7142660Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.7143856Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.7144776Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.7145519Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:31:52.7146550Z W0507 20:31:52.709000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.3958324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.3958845Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:53.3959149Z 
2025-05-07T20:31:53.3959237Z     @given(
2025-05-07T20:31:53.3959471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.3959790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.3960112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.3960444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.3960775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.3961063Z     )
2025-05-07T20:31:53.3961412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.3961847Z     def test_silu_mul_quant(
2025-05-07T20:31:53.3962087Z         self,
2025-05-07T20:31:53.3962277Z         T: int,
2025-05-07T20:31:53.3962463Z         D: int,
2025-05-07T20:31:53.3962675Z         scale_ub: Optional[float],
2025-05-07T20:31:53.3962947Z         contiguous: bool,
2025-05-07T20:31:53.3963178Z         compiled: bool,
2025-05-07T20:31:53.3963524Z     ) -> None:
2025-05-07T20:31:53.3963733Z         torch.manual_seed(2025)
2025-05-07T20:31:53.3963968Z     
2025-05-07T20:31:53.3964240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.3964577Z     
2025-05-07T20:31:53.3964768Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.3965279Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.3965586Z         x = x_sign * x_clamp
2025-05-07T20:31:53.3965818Z         x0 = x[:, :D]
2025-05-07T20:31:53.3966031Z         x1 = x[:, D:]
2025-05-07T20:31:53.3966243Z     
2025-05-07T20:31:53.3966433Z         if contiguous:
2025-05-07T20:31:53.3966658Z             x0 = x0.contiguous()
2025-05-07T20:31:53.3966916Z             x1 = x1.contiguous()
2025-05-07T20:31:53.3967157Z     
2025-05-07T20:31:53.3967341Z         if scale_ub is not None:
2025-05-07T20:31:53.3967614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.3967945Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.3968247Z             )
2025-05-07T20:31:53.3968436Z         else:
2025-05-07T20:31:53.3968642Z             scale_ub_tensor = None
2025-05-07T20:31:53.3968884Z     
2025-05-07T20:31:53.3969114Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.3969432Z             op = silu_mul_quant
2025-05-07T20:31:53.3969675Z             if compiled:
2025-05-07T20:31:53.3969917Z                 op = torch.compile(op)
2025-05-07T20:31:53.3970210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.3970476Z     
2025-05-07T20:31:53.3970668Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.3970838Z 
2025-05-07T20:31:53.3970936Z moe/activation_test.py:117: 
2025-05-07T20:31:53.3971228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.3971552Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.3971830Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.3972519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.3973202Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.3973872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.3974565Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.3975233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.3975760Z     kernel = self.compile(
2025-05-07T20:31:53.3976302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.3976960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.3977346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.3977579Z 
2025-05-07T20:31:53.3977784Z self = <triton.compiler.compiler.ASTSource object at 0x7fd396476b50>
2025-05-07T20:31:53.3978891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.3980256Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396b53b00>}
2025-05-07T20:31:53.3981595Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.3982614Z context = <triton._C.libtriton.ir.context object at 0x7fd3965d57f0>
2025-05-07T20:31:53.3982897Z 
2025-05-07T20:31:53.3983059Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.3983574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.3984043Z                            module_map=module_map)
2025-05-07T20:31:53.3984413Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.3984842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.3985106Z E       ^
2025-05-07T20:31:53.3985570Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.3986020Z 
2025-05-07T20:31:53.3986436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.3986954Z 
2025-05-07T20:31:53.3987055Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.3987464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.3987865Z     T=4096,
2025-05-07T20:31:53.3988046Z     D=7168,
2025-05-07T20:31:53.3988239Z     scale_ub=None,
2025-05-07T20:31:53.3988449Z     contiguous=False,
2025-05-07T20:31:53.3988672Z     compiled=False,
2025-05-07T20:31:53.3988876Z )
2025-05-07T20:31:53.3989190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.3989690Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:53.3989968Z 
2025-05-07T20:31:53.3990046Z     @given(
2025-05-07T20:31:53.3990268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.3990579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.3990879Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.3991228Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.3991577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.3991856Z     )
2025-05-07T20:31:53.3992203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.3992641Z     def test_silu_mul_quant(
2025-05-07T20:31:53.3992871Z         self,
2025-05-07T20:31:53.3993063Z         T: int,
2025-05-07T20:31:53.3993255Z         D: int,
2025-05-07T20:31:53.3993465Z         scale_ub: Optional[float],
2025-05-07T20:31:53.3993735Z         contiguous: bool,
2025-05-07T20:31:53.3994061Z         compiled: bool,
2025-05-07T20:31:53.3994276Z     ) -> None:
2025-05-07T20:31:53.3994492Z         torch.manual_seed(2025)
2025-05-07T20:31:53.3994730Z     
2025-05-07T20:31:53.3995001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.3995338Z     
2025-05-07T20:31:53.3995535Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.3995826Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.3996129Z         x = x_sign * x_clamp
2025-05-07T20:31:53.3996361Z         x0 = x[:, :D]
2025-05-07T20:31:53.3996579Z         x1 = x[:, D:]
2025-05-07T20:31:53.3996780Z     
2025-05-07T20:31:53.3996969Z         if contiguous:
2025-05-07T20:31:53.3997197Z             x0 = x0.contiguous()
2025-05-07T20:31:53.3997445Z             x1 = x1.contiguous()
2025-05-07T20:31:53.3997687Z     
2025-05-07T20:31:53.3997876Z         if scale_ub is not None:
2025-05-07T20:31:53.3998137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.3998481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.3998816Z             )
2025-05-07T20:31:53.3999023Z         else:
2025-05-07T20:31:53.3999243Z             scale_ub_tensor = None
2025-05-07T20:31:53.4005655Z     
2025-05-07T20:31:53.4005928Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.4006251Z             op = silu_mul_quant
2025-05-07T20:31:53.4006509Z             if compiled:
2025-05-07T20:31:53.4006756Z                 op = torch.compile(op)
2025-05-07T20:31:53.4007060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4007338Z     
2025-05-07T20:31:53.4007528Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.4007698Z 
2025-05-07T20:31:53.4007802Z moe/activation_test.py:117: 
2025-05-07T20:31:53.4008109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4008454Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.4008736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4009558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.4010263Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.4010800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.4011545Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.4012226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.4012767Z     kernel = self.compile(
2025-05-07T20:31:53.4013309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.4013971Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4014373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4014606Z 
2025-05-07T20:31:53.4014811Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3964e2ad0>
2025-05-07T20:31:53.4015892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.4017265Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396b52660>}
2025-05-07T20:31:53.4018622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.4019654Z context = <triton._C.libtriton.ir.context object at 0x7fd3964ef130>
2025-05-07T20:31:53.4019940Z 
2025-05-07T20:31:53.4020273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.4020806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4021327Z                            module_map=module_map)
2025-05-07T20:31:53.4021695Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4022043Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.4022301Z E       ^
2025-05-07T20:31:53.4022781Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4023243Z 
2025-05-07T20:31:53.4023663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.4024191Z 
2025-05-07T20:31:53.4024289Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.4024706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.4025118Z     T=128,
2025-05-07T20:31:53.4025308Z     D=7168,
2025-05-07T20:31:53.4025502Z     scale_ub=None,
2025-05-07T20:31:53.4025715Z     contiguous=False,
2025-05-07T20:31:53.4025936Z     compiled=True,
2025-05-07T20:31:53.4026130Z )
2025-05-07T20:31:53.4469879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.4470829Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:53.4471317Z 
2025-05-07T20:31:53.4471451Z     @given(
2025-05-07T20:31:53.4471685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.4471995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.4472299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.4472626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.4472951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.4473231Z     )
2025-05-07T20:31:53.4473596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.4474227Z     def test_silu_mul_quant(
2025-05-07T20:31:53.4474472Z         self,
2025-05-07T20:31:53.4474672Z         T: int,
2025-05-07T20:31:53.4474856Z         D: int,
2025-05-07T20:31:53.4475070Z         scale_ub: Optional[float],
2025-05-07T20:31:53.4475348Z         contiguous: bool,
2025-05-07T20:31:53.4475581Z         compiled: bool,
2025-05-07T20:31:53.4475808Z     ) -> None:
2025-05-07T20:31:53.4476021Z         torch.manual_seed(2025)
2025-05-07T20:31:53.4476252Z     
2025-05-07T20:31:53.4476521Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.4476858Z     
2025-05-07T20:31:53.4477045Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.4477331Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.4477642Z         x = x_sign * x_clamp
2025-05-07T20:31:53.4477876Z         x0 = x[:, :D]
2025-05-07T20:31:53.4478091Z         x1 = x[:, D:]
2025-05-07T20:31:53.4478291Z     
2025-05-07T20:31:53.4478479Z         if contiguous:
2025-05-07T20:31:53.4478709Z             x0 = x0.contiguous()
2025-05-07T20:31:53.4478959Z             x1 = x1.contiguous()
2025-05-07T20:31:53.4479194Z     
2025-05-07T20:31:53.4479374Z         if scale_ub is not None:
2025-05-07T20:31:53.4479643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.4479999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.4480302Z             )
2025-05-07T20:31:53.4480492Z         else:
2025-05-07T20:31:53.4480691Z             scale_ub_tensor = None
2025-05-07T20:31:53.4480934Z     
2025-05-07T20:31:53.4481189Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.4481515Z             op = silu_mul_quant
2025-05-07T20:31:53.4481758Z             if compiled:
2025-05-07T20:31:53.4481999Z                 op = torch.compile(op)
2025-05-07T20:31:53.4482285Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4482553Z     
2025-05-07T20:31:53.4482886Z         y_fp8, y_scale = fn()
2025-05-07T20:31:53.4483169Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:53.4483627Z     
2025-05-07T20:31:53.4483861Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.4484185Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:53.4484472Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:53.4484781Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:53.4485134Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.4485434Z     
2025-05-07T20:31:53.4485629Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:53.4485821Z 
2025-05-07T20:31:53.4485923Z moe/activation_test.py:126: 
2025-05-07T20:31:53.4486206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4486533Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:53.4486855Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.4487645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:53.4488403Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:53.4488947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.4489628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.4490308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:53.4491027Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.4491777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:53.4492524Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.4493352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:53.4493988Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:53.4494586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:53.4495105Z     fn()
2025-05-07T20:31:53.4495604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:53.4496182Z     self.fn.run(
2025-05-07T20:31:53.4496648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.4497168Z     kernel = self.compile(
2025-05-07T20:31:53.4497705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.4498363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4498760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4498989Z 
2025-05-07T20:31:53.4499194Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301e31110>
2025-05-07T20:31:53.4500267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.4501673Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3968ce480>}
2025-05-07T20:31:53.4503003Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.4504130Z context = <triton._C.libtriton.ir.context object at 0x7fd301e82170>
2025-05-07T20:31:53.4504422Z 
2025-05-07T20:31:53.4504590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.4505104Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4505570Z                            module_map=module_map)
2025-05-07T20:31:53.4505928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4506274Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:53.4506530Z E       ^
2025-05-07T20:31:53.4506987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4507437Z 
2025-05-07T20:31:53.4507853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.4508373Z 
2025-05-07T20:31:53.4508473Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.4508891Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.4509294Z     T=128,
2025-05-07T20:31:53.4509474Z     D=7168,
2025-05-07T20:31:53.4509660Z     scale_ub=None,
2025-05-07T20:31:53.4509875Z     contiguous=False,
2025-05-07T20:31:53.4510092Z     compiled=False,
2025-05-07T20:31:53.4510291Z )
2025-05-07T20:31:53.7726378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.7727463Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:53.7728007Z 
2025-05-07T20:31:53.7728169Z     @given(
2025-05-07T20:31:53.7728625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.7729242Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.7729829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.7730476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.7731117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.7731684Z     )
2025-05-07T20:31:53.7732030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.7732471Z     def test_silu_mul_quant(
2025-05-07T20:31:53.7732708Z         self,
2025-05-07T20:31:53.7732895Z         T: int,
2025-05-07T20:31:53.7733091Z         D: int,
2025-05-07T20:31:53.7733307Z         scale_ub: Optional[float],
2025-05-07T20:31:53.7733574Z         contiguous: bool,
2025-05-07T20:31:53.7733807Z         compiled: bool,
2025-05-07T20:31:53.7734029Z     ) -> None:
2025-05-07T20:31:53.7734236Z         torch.manual_seed(2025)
2025-05-07T20:31:53.7734476Z     
2025-05-07T20:31:53.7734748Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.7735091Z     
2025-05-07T20:31:53.7735282Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.7735571Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.7735867Z         x = x_sign * x_clamp
2025-05-07T20:31:53.7736108Z         x0 = x[:, :D]
2025-05-07T20:31:53.7736329Z         x1 = x[:, D:]
2025-05-07T20:31:53.7736530Z     
2025-05-07T20:31:53.7736708Z         if contiguous:
2025-05-07T20:31:53.7736935Z             x0 = x0.contiguous()
2025-05-07T20:31:53.7737183Z             x1 = x1.contiguous()
2025-05-07T20:31:53.7737424Z     
2025-05-07T20:31:53.7737613Z         if scale_ub is not None:
2025-05-07T20:31:53.7737881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.7738207Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.7738672Z             )
2025-05-07T20:31:53.7738864Z         else:
2025-05-07T20:31:53.7739067Z             scale_ub_tensor = None
2025-05-07T20:31:53.7739326Z     
2025-05-07T20:31:53.7739549Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.7739861Z             op = silu_mul_quant
2025-05-07T20:31:53.7740122Z             if compiled:
2025-05-07T20:31:53.7740372Z                 op = torch.compile(op)
2025-05-07T20:31:53.7740825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7741141Z     
2025-05-07T20:31:53.7741370Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.7741534Z 
2025-05-07T20:31:53.7741633Z moe/activation_test.py:117: 
2025-05-07T20:31:53.7741932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7742271Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.7742552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7743245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.7743940Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.7744486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.7745167Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.7745848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.7746390Z     kernel = self.compile(
2025-05-07T20:31:53.7746931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.7747600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.7748032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7748270Z 
2025-05-07T20:31:53.7748480Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301efee50>
2025-05-07T20:31:53.7749561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.7750932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3968cf240>}
2025-05-07T20:31:53.7752446Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.7753472Z context = <triton._C.libtriton.ir.context object at 0x7fd301ed7430>
2025-05-07T20:31:53.7753758Z 
2025-05-07T20:31:53.7753935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.7754460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.7754924Z                            module_map=module_map)
2025-05-07T20:31:53.7755291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.7755646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.7755901Z E       ^
2025-05-07T20:31:53.7756381Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.7756837Z 
2025-05-07T20:31:53.7757264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.7757781Z 
2025-05-07T20:31:53.7757890Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.7758302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.7758708Z     T=4096,
2025-05-07T20:31:53.7758900Z     D=5120,
2025-05-07T20:31:53.7759087Z     scale_ub=1200.0,
2025-05-07T20:31:53.7759313Z     contiguous=True,
2025-05-07T20:31:53.7759544Z     compiled=False,
2025-05-07T20:31:53.7759747Z )
2025-05-07T20:31:53.7760075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.7760576Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:53.7760848Z 
2025-05-07T20:31:53.7760932Z     @given(
2025-05-07T20:31:53.7761259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.7761608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.7761919Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.7762242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.7762576Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.7762866Z     )
2025-05-07T20:31:53.7763208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.7763760Z     def test_silu_mul_quant(
2025-05-07T20:31:53.7764001Z         self,
2025-05-07T20:31:53.7764192Z         T: int,
2025-05-07T20:31:53.7764393Z         D: int,
2025-05-07T20:31:53.7764622Z         scale_ub: Optional[float],
2025-05-07T20:31:53.7764897Z         contiguous: bool,
2025-05-07T20:31:53.7765131Z         compiled: bool,
2025-05-07T20:31:53.7765357Z     ) -> None:
2025-05-07T20:31:53.7765576Z         torch.manual_seed(2025)
2025-05-07T20:31:53.7765815Z     
2025-05-07T20:31:53.7766101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.7766451Z     
2025-05-07T20:31:53.7766640Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.7766936Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.7767250Z         x = x_sign * x_clamp
2025-05-07T20:31:53.7767484Z         x0 = x[:, :D]
2025-05-07T20:31:53.7767703Z         x1 = x[:, D:]
2025-05-07T20:31:53.7767910Z     
2025-05-07T20:31:53.7768091Z         if contiguous:
2025-05-07T20:31:53.7768325Z             x0 = x0.contiguous()
2025-05-07T20:31:53.7768579Z             x1 = x1.contiguous()
2025-05-07T20:31:53.7768815Z     
2025-05-07T20:31:53.7769004Z         if scale_ub is not None:
2025-05-07T20:31:53.7769279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.7769608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.7769918Z             )
2025-05-07T20:31:53.7770121Z         else:
2025-05-07T20:31:53.7770336Z             scale_ub_tensor = None
2025-05-07T20:31:53.7770668Z     
2025-05-07T20:31:53.7770898Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.7771237Z             op = silu_mul_quant
2025-05-07T20:31:53.7771506Z             if compiled:
2025-05-07T20:31:53.7771745Z                 op = torch.compile(op)
2025-05-07T20:31:53.7772040Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7772307Z     
2025-05-07T20:31:53.7772495Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.7772657Z 
2025-05-07T20:31:53.7772757Z moe/activation_test.py:117: 
2025-05-07T20:31:53.7773042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7773368Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.7773643Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7774334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.7775024Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.7775565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.7776250Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.7776911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.7777455Z     kernel = self.compile(
2025-05-07T20:31:53.7778000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.7778660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.7779053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7779286Z 
2025-05-07T20:31:53.7779493Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301609850>
2025-05-07T20:31:53.7780652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.7782021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3968cfba0>}
2025-05-07T20:31:53.7783358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.7784382Z context = <triton._C.libtriton.ir.context object at 0x7fd3016480f0>
2025-05-07T20:31:53.7784672Z 
2025-05-07T20:31:53.7784838Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.7785361Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.7785824Z                            module_map=module_map)
2025-05-07T20:31:53.7786194Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.7786544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.7786801Z E       ^
2025-05-07T20:31:53.7787257Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.7787714Z 
2025-05-07T20:31:53.7788131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.7788642Z 
2025-05-07T20:31:53.7788747Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.7789159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.7789551Z     T=1,
2025-05-07T20:31:53.7789734Z     D=5120,
2025-05-07T20:31:53.7789927Z     scale_ub=None,
2025-05-07T20:31:53.7790135Z     contiguous=True,
2025-05-07T20:31:53.7790357Z     compiled=True,
2025-05-07T20:31:53.7790563Z )
2025-05-07T20:31:54.1166940Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.1169079Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:31:54.1171481Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.1172924Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.1173906Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.1175214Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.1176599Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.1177576Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.1178798Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.1180339Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.1181410Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.1182685Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.1183937Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:31:54.1185163Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.1186380Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:31:54.1187209Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.1188236Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:54.1189252Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:31:54.1190038Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:54.1191252Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.1192685Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.1193800Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.1194838Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:31:54.1196019Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.1197377Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.1198449Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.1199364Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.1200100Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:31:54.1201114Z W0507 20:31:54.114000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.2027954Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.2030201Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:31:54.2032384Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.2033925Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.2034908Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.2036245Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.2037661Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.2039496Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.2041002Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.2042421Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.2043866Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.2045162Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.2046434Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:31:54.2047678Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.2048904Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:31:54.2049746Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.2050787Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:54.2051824Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:31:54.2052629Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:54.2053963Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.2055274Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.2056407Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.2057467Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:31:54.2058667Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.2060041Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.2061123Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.2062048Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.2062803Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:31:54.2063857Z W0507 20:31:54.200000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.5040916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.5041673Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:54.5042041Z 
2025-05-07T20:31:54.5042171Z     @given(
2025-05-07T20:31:54.5042792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.5043123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.5043569Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.5043908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.5044254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.5044555Z     )
2025-05-07T20:31:54.5044909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.5045374Z     def test_silu_mul_quant(
2025-05-07T20:31:54.5045634Z         self,
2025-05-07T20:31:54.5045838Z         T: int,
2025-05-07T20:31:54.5046052Z         D: int,
2025-05-07T20:31:54.5046291Z         scale_ub: Optional[float],
2025-05-07T20:31:54.5046570Z         contiguous: bool,
2025-05-07T20:31:54.5046825Z         compiled: bool,
2025-05-07T20:31:54.5047072Z     ) -> None:
2025-05-07T20:31:54.5047305Z         torch.manual_seed(2025)
2025-05-07T20:31:54.5047560Z     
2025-05-07T20:31:54.5047857Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.5048218Z     
2025-05-07T20:31:54.5048423Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.5048732Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.5049061Z         x = x_sign * x_clamp
2025-05-07T20:31:54.5049306Z         x0 = x[:, :D]
2025-05-07T20:31:54.5049538Z         x1 = x[:, D:]
2025-05-07T20:31:54.5049766Z     
2025-05-07T20:31:54.5049957Z         if contiguous:
2025-05-07T20:31:54.5050206Z             x0 = x0.contiguous()
2025-05-07T20:31:54.5050484Z             x1 = x1.contiguous()
2025-05-07T20:31:54.5050728Z     
2025-05-07T20:31:54.5050934Z         if scale_ub is not None:
2025-05-07T20:31:54.5051222Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.5051560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.5051886Z             )
2025-05-07T20:31:54.5052094Z         else:
2025-05-07T20:31:54.5052485Z             scale_ub_tensor = None
2025-05-07T20:31:54.5052755Z     
2025-05-07T20:31:54.5053013Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.5053354Z             op = silu_mul_quant
2025-05-07T20:31:54.5053613Z             if compiled:
2025-05-07T20:31:54.5053884Z                 op = torch.compile(op)
2025-05-07T20:31:54.5054201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.5054488Z     
2025-05-07T20:31:54.5054702Z         y_fp8, y_scale = fn()
2025-05-07T20:31:54.5055007Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:54.5055305Z     
2025-05-07T20:31:54.5055555Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.5055907Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:54.5056207Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:54.5056536Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:54.5056916Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:54.5057239Z     
2025-05-07T20:31:54.5057443Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:54.5057651Z 
2025-05-07T20:31:54.5057759Z moe/activation_test.py:126: 
2025-05-07T20:31:54.5058068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.5058409Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:54.5058752Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:54.5059560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:54.5060333Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:54.5060886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.5061639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.5062354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:54.5063189Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:54.5063964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:54.5064733Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:54.5065485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:54.5066147Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:54.5066775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:54.5067313Z     fn()
2025-05-07T20:31:54.5067849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:54.5068451Z     self.fn.run(
2025-05-07T20:31:54.5068925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.5069476Z     kernel = self.compile(
2025-05-07T20:31:54.5070032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.5070695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.5071112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.5071347Z 
2025-05-07T20:31:54.5071566Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3016c1cd0>
2025-05-07T20:31:54.5072736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.5074131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3969c3560>}
2025-05-07T20:31:54.5075505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.5076549Z context = <triton._C.libtriton.ir.context object at 0x7fd301665ff0>
2025-05-07T20:31:54.5076839Z 
2025-05-07T20:31:54.5077017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.5077544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.5078024Z                            module_map=module_map)
2025-05-07T20:31:54.5078400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.5078780Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:54.5079053Z E       ^
2025-05-07T20:31:54.5079531Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.5079989Z 
2025-05-07T20:31:54.5080422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.5080943Z 
2025-05-07T20:31:54.5081057Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.5081475Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.5081933Z     T=2048,
2025-05-07T20:31:54.5082135Z     D=5120,
2025-05-07T20:31:54.5082329Z     scale_ub=None,
2025-05-07T20:31:54.5082558Z     contiguous=True,
2025-05-07T20:31:54.5082789Z     compiled=True,
2025-05-07T20:31:54.5082999Z )
2025-05-07T20:31:54.8300455Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.8301956Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:31:54.8303311Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.8304764Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.8305745Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.8307071Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.8308476Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.8309469Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.8310714Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.8312312Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.8313404Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.8314701Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.8315966Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:31:54.8317208Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.8318426Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:31:54.8319269Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.8320311Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:54.8321343Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:31:54.8322151Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:54.8323541Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.8324954Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.8326083Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.8327143Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:31:54.8328358Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.8329719Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.8330802Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.8331722Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.8332475Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:31:54.8333499Z W0507 20:31:54.827000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.9155574Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.9156899Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:31:54.9158270Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.9159711Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.9160702Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.9162019Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.9163552Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.9164547Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.9165793Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.9167191Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.9168260Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.9169710Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.9170980Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:31:54.9172218Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.9173448Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:31:54.9174281Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.9175323Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:54.9176360Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:31:54.9177168Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:54.9178385Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.9179759Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.9180898Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.9181961Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:31:54.9183159Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.9184520Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.9185608Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.9186535Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.9187289Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:31:54.9188320Z W0507 20:31:54.913000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2170643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.2171247Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:55.2171533Z 
2025-05-07T20:31:55.2171620Z     @given(
2025-05-07T20:31:55.2171874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.2172209Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.2172552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.2173267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.2173601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.2173887Z     )
2025-05-07T20:31:55.2174245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.2174696Z     def test_silu_mul_quant(
2025-05-07T20:31:55.2174935Z         self,
2025-05-07T20:31:55.2175139Z         T: int,
2025-05-07T20:31:55.2175346Z         D: int,
2025-05-07T20:31:55.2175597Z         scale_ub: Optional[float],
2025-05-07T20:31:55.2175867Z         contiguous: bool,
2025-05-07T20:31:55.2176113Z         compiled: bool,
2025-05-07T20:31:55.2176352Z     ) -> None:
2025-05-07T20:31:55.2176567Z         torch.manual_seed(2025)
2025-05-07T20:31:55.2176814Z     
2025-05-07T20:31:55.2177096Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.2177440Z     
2025-05-07T20:31:55.2177651Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.2177953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.2178262Z         x = x_sign * x_clamp
2025-05-07T20:31:55.2178508Z         x0 = x[:, :D]
2025-05-07T20:31:55.2178730Z         x1 = x[:, D:]
2025-05-07T20:31:55.2178941Z     
2025-05-07T20:31:55.2179136Z         if contiguous:
2025-05-07T20:31:55.2179378Z             x0 = x0.contiguous()
2025-05-07T20:31:55.2179636Z             x1 = x1.contiguous()
2025-05-07T20:31:55.2179887Z     
2025-05-07T20:31:55.2180083Z         if scale_ub is not None:
2025-05-07T20:31:55.2180364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.2180697Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.2181011Z             )
2025-05-07T20:31:55.2181213Z         else:
2025-05-07T20:31:55.2181422Z             scale_ub_tensor = None
2025-05-07T20:31:55.2181681Z     
2025-05-07T20:31:55.2181921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2182397Z             op = silu_mul_quant
2025-05-07T20:31:55.2182657Z             if compiled:
2025-05-07T20:31:55.2182910Z                 op = torch.compile(op)
2025-05-07T20:31:55.2183207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.2183486Z     
2025-05-07T20:31:55.2183688Z         y_fp8, y_scale = fn()
2025-05-07T20:31:55.2183974Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:55.2184274Z     
2025-05-07T20:31:55.2184518Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2184862Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:55.2185154Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:55.2185475Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:55.2185840Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.2186150Z     
2025-05-07T20:31:55.2186358Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:55.2186555Z 
2025-05-07T20:31:55.2186673Z moe/activation_test.py:126: 
2025-05-07T20:31:55.2186975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2187321Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:55.2187657Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.2188459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:55.2189220Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:55.2189782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.2190479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.2191173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:55.2191965Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.2192822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:55.2193582Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.2194313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:55.2194963Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:55.2195577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:55.2196108Z     fn()
2025-05-07T20:31:55.2196617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:55.2197204Z     self.fn.run(
2025-05-07T20:31:55.2197689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.2198228Z     kernel = self.compile(
2025-05-07T20:31:55.2198780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.2199450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.2199855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2200087Z 
2025-05-07T20:31:55.2200299Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3017c3d90>
2025-05-07T20:31:55.2201390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.2203576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3969c2f20>}
2025-05-07T20:31:55.2204952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.2205986Z context = <triton._C.libtriton.ir.context object at 0x7fd30174bb30>
2025-05-07T20:31:55.2206275Z 
2025-05-07T20:31:55.2206446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.2206977Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.2207454Z                            module_map=module_map)
2025-05-07T20:31:55.2207819Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.2208188Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:55.2208461Z E       ^
2025-05-07T20:31:55.2208934Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2209399Z 
2025-05-07T20:31:55.2209823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.2210348Z 
2025-05-07T20:31:55.2210452Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.2210874Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.2211276Z     T=128,
2025-05-07T20:31:55.2211473Z     D=5120,
2025-05-07T20:31:55.2211677Z     scale_ub=None,
2025-05-07T20:31:55.2211906Z     contiguous=True,
2025-05-07T20:31:55.2212169Z     compiled=True,
2025-05-07T20:31:55.2212382Z )
2025-05-07T20:31:55.5623922Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.5624994Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:31:55.5626548Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.5627970Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.5628930Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.5630222Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.5631608Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.5632583Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.5633799Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.5635161Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.5636352Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.5637647Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.5639050Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:31:55.5640264Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.5641469Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:31:55.5642289Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.5643394Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:55.5644404Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:31:55.5645197Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:55.5646394Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.5647671Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.5648913Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:55.5649950Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:31:55.5651118Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.5652522Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.5653580Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.5654494Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.5655228Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:31:55.5656239Z W0507 20:31:55.560000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.6497353Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.6498416Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:31:55.6500244Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.6501670Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.6502636Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.6503928Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.6505309Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.6506290Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.6507510Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.6508887Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.6509943Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.6511217Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.6512634Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:31:55.6513840Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.6515039Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:31:55.6515852Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.6516873Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:55.6517891Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:31:55.6518679Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:55.6519878Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.6521149Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.6522351Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:55.6523513Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:31:55.6524688Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.6526039Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.6527099Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.6528010Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.6528746Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:31:55.6529753Z W0507 20:31:55.647000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.1846548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.1847103Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:56.1847370Z 
2025-05-07T20:31:56.1847448Z     @given(
2025-05-07T20:31:56.1847683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.1847994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.1848298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.1848628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.1848965Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.1849465Z     )
2025-05-07T20:31:56.1849821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.1850265Z     def test_silu_mul_quant(
2025-05-07T20:31:56.1850500Z         self,
2025-05-07T20:31:56.1850698Z         T: int,
2025-05-07T20:31:56.1850900Z         D: int,
2025-05-07T20:31:56.1851113Z         scale_ub: Optional[float],
2025-05-07T20:31:56.1851389Z         contiguous: bool,
2025-05-07T20:31:56.1851633Z         compiled: bool,
2025-05-07T20:31:56.1851859Z     ) -> None:
2025-05-07T20:31:56.1852079Z         torch.manual_seed(2025)
2025-05-07T20:31:56.1852321Z     
2025-05-07T20:31:56.1852589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.1852935Z     
2025-05-07T20:31:56.1853134Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.1853428Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.1853745Z         x = x_sign * x_clamp
2025-05-07T20:31:56.1854006Z         x0 = x[:, :D]
2025-05-07T20:31:56.1854230Z         x1 = x[:, D:]
2025-05-07T20:31:56.1854436Z     
2025-05-07T20:31:56.1854624Z         if contiguous:
2025-05-07T20:31:56.1854861Z             x0 = x0.contiguous()
2025-05-07T20:31:56.1855113Z             x1 = x1.contiguous()
2025-05-07T20:31:56.1855355Z     
2025-05-07T20:31:56.1855548Z         if scale_ub is not None:
2025-05-07T20:31:56.1855816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.1856155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.1856467Z             )
2025-05-07T20:31:56.1856666Z         else:
2025-05-07T20:31:56.1856879Z             scale_ub_tensor = None
2025-05-07T20:31:56.1863658Z     
2025-05-07T20:31:56.1863910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.1864254Z             op = silu_mul_quant
2025-05-07T20:31:56.1864506Z             if compiled:
2025-05-07T20:31:56.1864767Z                 op = torch.compile(op)
2025-05-07T20:31:56.1865243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.1865517Z     
2025-05-07T20:31:56.1865724Z         y_fp8, y_scale = fn()
2025-05-07T20:31:56.1866019Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:56.1866311Z     
2025-05-07T20:31:56.1866556Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.1866891Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:56.1867179Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:56.1867498Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:56.1867856Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.1868162Z     
2025-05-07T20:31:56.1868369Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:56.1868571Z 
2025-05-07T20:31:56.1868674Z moe/activation_test.py:126: 
2025-05-07T20:31:56.1868980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1869319Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:56.1869659Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.1870462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:56.1871219Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:56.1871777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.1872526Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.1873232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:56.1873960Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.1874728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:56.1875579Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.1876317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:56.1876955Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:56.1877566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:56.1878093Z     fn()
2025-05-07T20:31:56.1878601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:56.1879204Z     self.fn.run(
2025-05-07T20:31:56.1879689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.1880233Z     kernel = self.compile(
2025-05-07T20:31:56.1880780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.1881457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.1881868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1882117Z 
2025-05-07T20:31:56.1882364Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301435290>
2025-05-07T20:31:56.1883523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.1884890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396163c40>}
2025-05-07T20:31:56.1886316Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.1887355Z context = <triton._C.libtriton.ir.context object at 0x7fd30143d030>
2025-05-07T20:31:56.1887640Z 
2025-05-07T20:31:56.1887807Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.1888337Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.1888828Z                            module_map=module_map)
2025-05-07T20:31:56.1889213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.1889573Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:56.1889850Z E       ^
2025-05-07T20:31:56.1890330Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.1890783Z 
2025-05-07T20:31:56.1891215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.1891753Z 
2025-05-07T20:31:56.1891864Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.1892291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.1892706Z     T=4096,
2025-05-07T20:31:56.1892901Z     D=5120,
2025-05-07T20:31:56.1893103Z     scale_ub=None,
2025-05-07T20:31:56.1893336Z     contiguous=True,
2025-05-07T20:31:56.1893561Z     compiled=True,
2025-05-07T20:31:56.1893783Z )
2025-05-07T20:31:56.5348494Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.5349582Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:31:56.5350960Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.5352558Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.5353536Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.5354851Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.5356245Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.5357248Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.5358485Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.5359868Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.5360937Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.5362383Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.5363758Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:31:56.5364985Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.5366197Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:31:56.5367025Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.5368052Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:56.5369084Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:31:56.5369883Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:56.5371094Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.5372382Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.5373503Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.5374634Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:31:56.5375822Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.5377181Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.5378241Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.5379158Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.5379904Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:31:56.5380938Z W0507 20:31:56.532000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.6227731Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.6229660Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:31:56.6232092Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.6233830Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.6234812Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.6236126Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.6237508Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.6238649Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.6239889Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.6241271Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.6242334Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.6243726Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.6245129Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:31:56.6246350Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.6247560Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:31:56.6248392Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.6249421Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:56.6250442Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:31:56.6251252Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:56.6252465Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.6253751Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.6254863Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.6256020Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:31:56.6257212Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.6258573Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.6259634Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.6260543Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.6261290Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:31:56.6262318Z W0507 20:31:56.620000 87308 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.9776131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.9776676Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:56.9776957Z 
2025-05-07T20:31:56.9777038Z     @given(
2025-05-07T20:31:56.9777283Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.9777597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.9777911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.9778246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.9778584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.9778870Z     )
2025-05-07T20:31:56.9779226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.9779844Z     def test_silu_mul_quant(
2025-05-07T20:31:56.9780085Z         self,
2025-05-07T20:31:56.9780286Z         T: int,
2025-05-07T20:31:56.9780489Z         D: int,
2025-05-07T20:31:56.9780704Z         scale_ub: Optional[float],
2025-05-07T20:31:56.9780978Z         contiguous: bool,
2025-05-07T20:31:56.9781219Z         compiled: bool,
2025-05-07T20:31:56.9781443Z     ) -> None:
2025-05-07T20:31:56.9781662Z         torch.manual_seed(2025)
2025-05-07T20:31:56.9781906Z     
2025-05-07T20:31:56.9782178Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.9782523Z     
2025-05-07T20:31:56.9782725Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.9783014Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.9783329Z         x = x_sign * x_clamp
2025-05-07T20:31:56.9783574Z         x0 = x[:, :D]
2025-05-07T20:31:56.9783796Z         x1 = x[:, D:]
2025-05-07T20:31:56.9784006Z     
2025-05-07T20:31:56.9784200Z         if contiguous:
2025-05-07T20:31:56.9784445Z             x0 = x0.contiguous()
2025-05-07T20:31:56.9784703Z             x1 = x1.contiguous()
2025-05-07T20:31:56.9784954Z     
2025-05-07T20:31:56.9785148Z         if scale_ub is not None:
2025-05-07T20:31:56.9785419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.9785758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.9786070Z             )
2025-05-07T20:31:56.9786261Z         else:
2025-05-07T20:31:56.9786477Z             scale_ub_tensor = None
2025-05-07T20:31:56.9786737Z     
2025-05-07T20:31:56.9786969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.9787290Z             op = silu_mul_quant
2025-05-07T20:31:56.9787545Z             if compiled:
2025-05-07T20:31:56.9787793Z                 op = torch.compile(op)
2025-05-07T20:31:56.9788095Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.9788376Z     
2025-05-07T20:31:56.9788575Z         y_fp8, y_scale = fn()
2025-05-07T20:31:56.9788978Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:56.9789280Z     
2025-05-07T20:31:56.9789522Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.9789852Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:56.9790146Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:56.9790466Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:56.9790823Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.9791132Z     
2025-05-07T20:31:56.9791339Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:56.9791533Z 
2025-05-07T20:31:56.9791643Z moe/activation_test.py:126: 
2025-05-07T20:31:56.9791945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.9792286Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:56.9792609Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.9793409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:56.9794174Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:56.9794721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.9795411Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.9796107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:56.9796838Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.9797587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:56.9798343Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.9799086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:56.9799815Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:56.9800418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:56.9800940Z     fn()
2025-05-07T20:31:56.9801454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:56.9802030Z     self.fn.run(
2025-05-07T20:31:56.9802505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.9803040Z     kernel = self.compile(
2025-05-07T20:31:56.9803730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.9804388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.9804796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.9805032Z 
2025-05-07T20:31:56.9805247Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301a4b7d0>
2025-05-07T20:31:56.9806335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.9807706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd301b49760>}
2025-05-07T20:31:56.9809049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.9810158Z context = <triton._C.libtriton.ir.context object at 0x7fd301a4f570>
2025-05-07T20:31:56.9810451Z 
2025-05-07T20:31:56.9810624Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.9811147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.9811611Z                            module_map=module_map)
2025-05-07T20:31:56.9811981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.9812343Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:56.9812606Z E       ^
2025-05-07T20:31:56.9813075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.9813524Z 
2025-05-07T20:31:56.9813951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.9814469Z 
2025-05-07T20:31:56.9814577Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.9814994Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.9815403Z     T=16384,
2025-05-07T20:31:56.9815599Z     D=5120,
2025-05-07T20:31:56.9815790Z     scale_ub=None,
2025-05-07T20:31:56.9816004Z     contiguous=True,
2025-05-07T20:31:56.9816232Z     compiled=True,
2025-05-07T20:31:56.9816432Z )
2025-05-07T20:31:57.0072214Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:57.0073463Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:57.0074795Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:57.0075789Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:57.0077040Z W0507 20:31:57.006000 87308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:57.0760638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.0761954Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.0762452Z 
2025-05-07T20:31:57.0762589Z     @given(
2025-05-07T20:31:57.0762953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.0763586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.0764090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.0764621Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.0765167Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.0765625Z     )
2025-05-07T20:31:57.0766249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.0767004Z     def test_silu_mul_quant(
2025-05-07T20:31:57.0767395Z         self,
2025-05-07T20:31:57.0767701Z         T: int,
2025-05-07T20:31:57.0768007Z         D: int,
2025-05-07T20:31:57.0768358Z         scale_ub: Optional[float],
2025-05-07T20:31:57.0768800Z         contiguous: bool,
2025-05-07T20:31:57.0769187Z         compiled: bool,
2025-05-07T20:31:57.0769552Z     ) -> None:
2025-05-07T20:31:57.0769896Z         torch.manual_seed(2025)
2025-05-07T20:31:57.0770281Z     
2025-05-07T20:31:57.0770742Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.0771309Z     
2025-05-07T20:31:57.0771618Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.0772082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.0772590Z         x = x_sign * x_clamp
2025-05-07T20:31:57.0772979Z         x0 = x[:, :D]
2025-05-07T20:31:57.0773713Z         x1 = x[:, D:]
2025-05-07T20:31:57.0774060Z     
2025-05-07T20:31:57.0774355Z         if contiguous:
2025-05-07T20:31:57.0774715Z             x0 = x0.contiguous()
2025-05-07T20:31:57.0775136Z             x1 = x1.contiguous()
2025-05-07T20:31:57.0775522Z     
2025-05-07T20:31:57.0775823Z         if scale_ub is not None:
2025-05-07T20:31:57.0776253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.0776795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.0777294Z             )
2025-05-07T20:31:57.0777607Z         else:
2025-05-07T20:31:57.0777946Z             scale_ub_tensor = None
2025-05-07T20:31:57.0778375Z     
2025-05-07T20:31:57.0778751Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.0779270Z             op = silu_mul_quant
2025-05-07T20:31:57.0779679Z             if compiled:
2025-05-07T20:31:57.0780040Z                 op = torch.compile(op)
2025-05-07T20:31:57.0780497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.0780935Z     
2025-05-07T20:31:57.0781230Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.0781695Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.0782145Z     
2025-05-07T20:31:57.0782517Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.0783111Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.0783624Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.0784181Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.0784809Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.0785340Z     
2025-05-07T20:31:57.0785689Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.0786030Z 
2025-05-07T20:31:57.0786207Z moe/activation_test.py:126: 
2025-05-07T20:31:57.0786721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.0787318Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.0787896Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.0789605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.0790977Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.0791958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.0793176Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.0794421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.0795695Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.0796924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:57.0798208Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.0799562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.0800745Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.0801842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.0802823Z     fn()
2025-05-07T20:31:57.0803912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.0804976Z     self.fn.run(
2025-05-07T20:31:57.0805806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.0806776Z     kernel = self.compile(
2025-05-07T20:31:57.0807753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.0809099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.0809810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.0810235Z 
2025-05-07T20:31:57.0810557Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3013d98d0>
2025-05-07T20:31:57.0812319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.0814906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd30110b600>}
2025-05-07T20:31:57.0817357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.0819227Z context = <triton._C.libtriton.ir.context object at 0x7fd3013dd730>
2025-05-07T20:31:57.0819758Z 
2025-05-07T20:31:57.0820047Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.0820989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.0821835Z                            module_map=module_map)
2025-05-07T20:31:57.0822460Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.0823093Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.0823556Z E       ^
2025-05-07T20:31:57.0824377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.0825203Z 
2025-05-07T20:31:57.0825928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.0826871Z 
2025-05-07T20:31:57.0827241Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.0827972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.0828680Z     T=1,
2025-05-07T20:31:57.0828998Z     D=5120,
2025-05-07T20:31:57.0829334Z     scale_ub=1200.0,
2025-05-07T20:31:57.0829711Z     contiguous=True,
2025-05-07T20:31:57.0830094Z     compiled=True,
2025-05-07T20:31:57.0830465Z )
2025-05-07T20:31:57.1908702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.1909606Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.1910066Z 
2025-05-07T20:31:57.1910193Z     @given(
2025-05-07T20:31:57.1910574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.1911079Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.1911587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.1912133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.1912776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.1913233Z     )
2025-05-07T20:31:57.1913818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.1914503Z     def test_silu_mul_quant(
2025-05-07T20:31:57.1914863Z         self,
2025-05-07T20:31:57.1915177Z         T: int,
2025-05-07T20:31:57.1915487Z         D: int,
2025-05-07T20:31:57.1915827Z         scale_ub: Optional[float],
2025-05-07T20:31:57.1916267Z         contiguous: bool,
2025-05-07T20:31:57.1916639Z         compiled: bool,
2025-05-07T20:31:57.1916986Z     ) -> None:
2025-05-07T20:31:57.1917330Z         torch.manual_seed(2025)
2025-05-07T20:31:57.1917728Z     
2025-05-07T20:31:57.1918169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.1918747Z     
2025-05-07T20:31:57.1919056Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.1919536Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.1920472Z         x = x_sign * x_clamp
2025-05-07T20:31:57.1920894Z         x0 = x[:, :D]
2025-05-07T20:31:57.1921249Z         x1 = x[:, D:]
2025-05-07T20:31:57.1921579Z     
2025-05-07T20:31:57.1921882Z         if contiguous:
2025-05-07T20:31:57.1922257Z             x0 = x0.contiguous()
2025-05-07T20:31:57.1922672Z             x1 = x1.contiguous()
2025-05-07T20:31:57.1923065Z     
2025-05-07T20:31:57.1923504Z         if scale_ub is not None:
2025-05-07T20:31:57.1923943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.1924496Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.1925015Z             )
2025-05-07T20:31:57.1925317Z         else:
2025-05-07T20:31:57.1925650Z             scale_ub_tensor = None
2025-05-07T20:31:57.1926087Z     
2025-05-07T20:31:57.1926445Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1926968Z             op = silu_mul_quant
2025-05-07T20:31:57.1939488Z             if compiled:
2025-05-07T20:31:57.1939940Z                 op = torch.compile(op)
2025-05-07T20:31:57.1940471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1940963Z     
2025-05-07T20:31:57.1941290Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.1941593Z 
2025-05-07T20:31:57.1941763Z moe/activation_test.py:117: 
2025-05-07T20:31:57.1942280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1942854Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.1943345Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1944355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.1945362Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.1946477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.1947603Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.1948545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.1950059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.1951262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.1952226Z     kernel = self.compile(
2025-05-07T20:31:57.1953205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.1954383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.1955091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1955501Z 
2025-05-07T20:31:57.1955875Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300f62210>
2025-05-07T20:31:57.1957840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.1960325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd301c96ac0>}
2025-05-07T20:31:57.1962746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.1964750Z context = <triton._C.libtriton.ir.context object at 0x7fd300ffa770>
2025-05-07T20:31:57.1965258Z 
2025-05-07T20:31:57.1965562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.1966485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.1967325Z                            module_map=module_map)
2025-05-07T20:31:57.1968192Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.1968823Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.1969263Z E       ^
2025-05-07T20:31:57.1970094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1970917Z 
2025-05-07T20:31:57.1971686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.1972617Z 
2025-05-07T20:31:57.1972808Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1973489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1974199Z     T=1,
2025-05-07T20:31:57.1974519Z     D=5120,
2025-05-07T20:31:57.1974840Z     scale_ub=None,
2025-05-07T20:31:57.1975216Z     contiguous=False,
2025-05-07T20:31:57.1975604Z     compiled=True,
2025-05-07T20:31:57.1975948Z )
2025-05-07T20:31:57.4277935Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.4278518Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.4278790Z 
2025-05-07T20:31:57.4278885Z     @given(
2025-05-07T20:31:57.4279127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.4279459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.4279782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.4280122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.4280453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.4280753Z     )
2025-05-07T20:31:57.4281120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.4281572Z     def test_silu_mul_quant(
2025-05-07T20:31:57.4281829Z         self,
2025-05-07T20:31:57.4282039Z         T: int,
2025-05-07T20:31:57.4282242Z         D: int,
2025-05-07T20:31:57.4282473Z         scale_ub: Optional[float],
2025-05-07T20:31:57.4283141Z         contiguous: bool,
2025-05-07T20:31:57.4283507Z         compiled: bool,
2025-05-07T20:31:57.4283747Z     ) -> None:
2025-05-07T20:31:57.4283982Z         torch.manual_seed(2025)
2025-05-07T20:31:57.4284233Z     
2025-05-07T20:31:57.4284526Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.4284885Z     
2025-05-07T20:31:57.4285092Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.4285390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.4285714Z         x = x_sign * x_clamp
2025-05-07T20:31:57.4285967Z         x0 = x[:, :D]
2025-05-07T20:31:57.4286188Z         x1 = x[:, D:]
2025-05-07T20:31:57.4286416Z     
2025-05-07T20:31:57.4286620Z         if contiguous:
2025-05-07T20:31:57.4286861Z             x0 = x0.contiguous()
2025-05-07T20:31:57.4287132Z             x1 = x1.contiguous()
2025-05-07T20:31:57.4287390Z     
2025-05-07T20:31:57.4287581Z         if scale_ub is not None:
2025-05-07T20:31:57.4287869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.4288220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.4288538Z             )
2025-05-07T20:31:57.4288735Z         else:
2025-05-07T20:31:57.4288954Z             scale_ub_tensor = None
2025-05-07T20:31:57.4289219Z     
2025-05-07T20:31:57.4289452Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.4289779Z             op = silu_mul_quant
2025-05-07T20:31:57.4290036Z             if compiled:
2025-05-07T20:31:57.4290282Z                 op = torch.compile(op)
2025-05-07T20:31:57.4290587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4290874Z     
2025-05-07T20:31:57.4291066Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.4291361Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.4291664Z     
2025-05-07T20:31:57.4291902Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.4292418Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.4292774Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.4293104Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.4293466Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.4293787Z     
2025-05-07T20:31:57.4294001Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.4294201Z 
2025-05-07T20:31:57.4294304Z moe/activation_test.py:126: 
2025-05-07T20:31:57.4294609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4294953Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.4295280Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.4296079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.4296846Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.4297414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.4298110Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.4298817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.4299554Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.4300330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:57.4301084Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.4301826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.4302480Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.4303100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.4303717Z     fn()
2025-05-07T20:31:57.4304238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.4304832Z     self.fn.run(
2025-05-07T20:31:57.4305308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.4305849Z     kernel = self.compile(
2025-05-07T20:31:57.4306404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.4307062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.4307469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4307709Z 
2025-05-07T20:31:57.4307921Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300f5d890>
2025-05-07T20:31:57.4309018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.4310442Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd301c97e20>}
2025-05-07T20:31:57.4311784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.4312823Z context = <triton._C.libtriton.ir.context object at 0x7fd300f6fe70>
2025-05-07T20:31:57.4313123Z 
2025-05-07T20:31:57.4313295Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.4313917Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.4314392Z                            module_map=module_map)
2025-05-07T20:31:57.4314771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.4315149Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.4315414Z E       ^
2025-05-07T20:31:57.4315883Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.4316342Z 
2025-05-07T20:31:57.4316762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.4317277Z 
2025-05-07T20:31:57.4317389Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.4317799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.4318205Z     T=1,
2025-05-07T20:31:57.4318397Z     D=5120,
2025-05-07T20:31:57.4318589Z     scale_ub=None,
2025-05-07T20:31:57.4318808Z     contiguous=True,
2025-05-07T20:31:57.4319044Z     compiled=False,
2025-05-07T20:31:57.4319256Z )
2025-05-07T20:31:57.5501328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.5502112Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.5502495Z 
2025-05-07T20:31:57.5502607Z     @given(
2025-05-07T20:31:57.5502937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.5503352Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.5503763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.5504197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.5504574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.5504859Z     )
2025-05-07T20:31:57.5505219Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.5505671Z     def test_silu_mul_quant(
2025-05-07T20:31:57.5505913Z         self,
2025-05-07T20:31:57.5506121Z         T: int,
2025-05-07T20:31:57.5506696Z         D: int,
2025-05-07T20:31:57.5506919Z         scale_ub: Optional[float],
2025-05-07T20:31:57.5507206Z         contiguous: bool,
2025-05-07T20:31:57.5507454Z         compiled: bool,
2025-05-07T20:31:57.5507687Z     ) -> None:
2025-05-07T20:31:57.5507914Z         torch.manual_seed(2025)
2025-05-07T20:31:57.5508163Z     
2025-05-07T20:31:57.5508440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5508796Z     
2025-05-07T20:31:57.5509001Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.5509300Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.5509620Z         x = x_sign * x_clamp
2025-05-07T20:31:57.5509869Z         x0 = x[:, :D]
2025-05-07T20:31:57.5510099Z         x1 = x[:, D:]
2025-05-07T20:31:57.5510315Z     
2025-05-07T20:31:57.5510516Z         if contiguous:
2025-05-07T20:31:57.5510760Z             x0 = x0.contiguous()
2025-05-07T20:31:57.5511024Z             x1 = x1.contiguous()
2025-05-07T20:31:57.5511279Z     
2025-05-07T20:31:57.5511511Z         if scale_ub is not None:
2025-05-07T20:31:57.5511796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.5512134Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.5512450Z             )
2025-05-07T20:31:57.5512654Z         else:
2025-05-07T20:31:57.5512875Z             scale_ub_tensor = None
2025-05-07T20:31:57.5513129Z     
2025-05-07T20:31:57.5513374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.5513699Z             op = silu_mul_quant
2025-05-07T20:31:57.5513948Z             if compiled:
2025-05-07T20:31:57.5514203Z                 op = torch.compile(op)
2025-05-07T20:31:57.5514510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.5514788Z     
2025-05-07T20:31:57.5514992Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.5515158Z 
2025-05-07T20:31:57.5515273Z moe/activation_test.py:117: 
2025-05-07T20:31:57.5515744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.5516096Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.5516388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.5517094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.5517788Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.5518335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.5519028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.5519700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.5520247Z     kernel = self.compile(
2025-05-07T20:31:57.5520802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.5521480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.5521880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.5522121Z 
2025-05-07T20:31:57.5522333Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30016c9d0>
2025-05-07T20:31:57.5523623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.5525018Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3009f2de0>}
2025-05-07T20:31:57.5526365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.5527524Z context = <triton._C.libtriton.ir.context object at 0x7fd30024bb30>
2025-05-07T20:31:57.5527823Z 
2025-05-07T20:31:57.5527994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.5528525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.5528994Z                            module_map=module_map)
2025-05-07T20:31:57.5529369Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.5529737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.5530003Z E       ^
2025-05-07T20:31:57.5530471Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.5530931Z 
2025-05-07T20:31:57.5531352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.5531872Z 
2025-05-07T20:31:57.5531997Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.5532427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5532828Z     T=128,
2025-05-07T20:31:57.5533028Z     D=5120,
2025-05-07T20:31:57.5533232Z     scale_ub=None,
2025-05-07T20:31:57.5533449Z     contiguous=False,
2025-05-07T20:31:57.5533682Z     compiled=True,
2025-05-07T20:31:57.5533904Z )
2025-05-07T20:31:57.5534226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.5534728Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.5534999Z 
2025-05-07T20:31:57.5535090Z     @given(
2025-05-07T20:31:57.5535328Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.5535648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.5535961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.5536293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.5536698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.5537002Z     )
2025-05-07T20:31:57.5537358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.5537800Z     def test_silu_mul_quant(
2025-05-07T20:31:57.5538050Z         self,
2025-05-07T20:31:57.5538246Z         T: int,
2025-05-07T20:31:57.5538736Z         D: int,
2025-05-07T20:31:57.5538961Z         scale_ub: Optional[float],
2025-05-07T20:31:57.5539233Z         contiguous: bool,
2025-05-07T20:31:57.5539469Z         compiled: bool,
2025-05-07T20:31:57.5539699Z     ) -> None:
2025-05-07T20:31:57.5539920Z         torch.manual_seed(2025)
2025-05-07T20:31:57.5540156Z     
2025-05-07T20:31:57.5540427Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5540773Z     
2025-05-07T20:31:57.5540962Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.5541254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
﻿2025-05-07T20:31:57.5546576Z         x = x_sign * x_clamp
2025-05-07T20:31:57.5546922Z         x0 = x[:, :D]
2025-05-07T20:31:57.5547146Z         x1 = x[:, D:]
2025-05-07T20:31:57.5547362Z     
2025-05-07T20:31:57.5547547Z         if contiguous:
2025-05-07T20:31:57.5547786Z             x0 = x0.contiguous()
2025-05-07T20:31:57.5548048Z             x1 = x1.contiguous()
2025-05-07T20:31:57.5548283Z     
2025-05-07T20:31:57.5548483Z         if scale_ub is not None:
2025-05-07T20:31:57.5548761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.5549093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.5549408Z             )
2025-05-07T20:31:57.5549612Z         else:
2025-05-07T20:31:57.5549827Z             scale_ub_tensor = None
2025-05-07T20:31:57.5550077Z     
2025-05-07T20:31:57.5550316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.5550638Z             op = silu_mul_quant
2025-05-07T20:31:57.5550886Z             if compiled:
2025-05-07T20:31:57.5551142Z                 op = torch.compile(op)
2025-05-07T20:31:57.5551582Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.5551856Z     
2025-05-07T20:31:57.5552051Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.5552221Z 
2025-05-07T20:31:57.5552332Z moe/activation_test.py:117: 
2025-05-07T20:31:57.5552673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.5553010Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.5553298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.5553860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.5554433Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.5555103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.5555797Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.5556343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.5557042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.5557717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.5558253Z     kernel = self.compile(
2025-05-07T20:31:57.5558803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.5559473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.5559876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.5560105Z 
2025-05-07T20:31:57.5560316Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300db49d0>
2025-05-07T20:31:57.5561525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.5562904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3009f27a0>}
2025-05-07T20:31:57.5564381Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.5565409Z context = <triton._C.libtriton.ir.context object at 0x7fd3002b9970>
2025-05-07T20:31:57.5565695Z 
2025-05-07T20:31:57.5565864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.5566390Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.5566864Z                            module_map=module_map)
2025-05-07T20:31:57.5567347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.5567709Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.5567973Z E       ^
2025-05-07T20:31:57.5568447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.5568901Z 
2025-05-07T20:31:57.5569320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.5569844Z 
2025-05-07T20:31:57.5569950Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.5570374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5570781Z     T=128,
2025-05-07T20:31:57.5570978Z     D=7168,
2025-05-07T20:31:57.5571176Z     scale_ub=1200.0,
2025-05-07T20:31:57.5571408Z     contiguous=False,
2025-05-07T20:31:57.5571632Z     compiled=False,
2025-05-07T20:31:57.5571838Z )
2025-05-07T20:31:57.6447580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6448562Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6448951Z 
2025-05-07T20:31:57.6449103Z     @given(
2025-05-07T20:31:57.6449349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6449667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6449978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6450306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6450641Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6450930Z     )
2025-05-07T20:31:57.6451279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6451732Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6451983Z         self,
2025-05-07T20:31:57.6452182Z         T: int,
2025-05-07T20:31:57.6452378Z         D: int,
2025-05-07T20:31:57.6452611Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6452938Z         contiguous: bool,
2025-05-07T20:31:57.6453183Z         compiled: bool,
2025-05-07T20:31:57.6453423Z     ) -> None:
2025-05-07T20:31:57.6453651Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6453891Z     
2025-05-07T20:31:57.6454174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6454523Z     
2025-05-07T20:31:57.6454721Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6455021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6455337Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6455581Z         x0 = x[:, :D]
2025-05-07T20:31:57.6455804Z         x1 = x[:, D:]
2025-05-07T20:31:57.6456018Z     
2025-05-07T20:31:57.6456208Z         if contiguous:
2025-05-07T20:31:57.6456447Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6456709Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6456945Z     
2025-05-07T20:31:57.6457145Z         if scale_ub is not None:
2025-05-07T20:31:57.6457599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6457948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6458255Z             )
2025-05-07T20:31:57.6458454Z         else:
2025-05-07T20:31:57.6458668Z             scale_ub_tensor = None
2025-05-07T20:31:57.6458920Z     
2025-05-07T20:31:57.6459158Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6459478Z             op = silu_mul_quant
2025-05-07T20:31:57.6459728Z             if compiled:
2025-05-07T20:31:57.6459979Z                 op = torch.compile(op)
2025-05-07T20:31:57.6460282Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6460559Z     
2025-05-07T20:31:57.6460761Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6460926Z 
2025-05-07T20:31:57.6461033Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6461324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6461661Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6462069Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6462771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6463466Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6464014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6464831Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6465515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6466137Z     kernel = self.compile(
2025-05-07T20:31:57.6466702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6467370Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6467778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6468089Z 
2025-05-07T20:31:57.6468302Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3001d5850>
2025-05-07T20:31:57.6478391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6480168Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3009f23e0>}
2025-05-07T20:31:57.6481868Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6483196Z context = <triton._C.libtriton.ir.context object at 0x7fd300199e30>
2025-05-07T20:31:57.6483637Z 
2025-05-07T20:31:57.6483812Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6484343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6484815Z                            module_map=module_map)
2025-05-07T20:31:57.6485181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6485540Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6485808Z E       ^
2025-05-07T20:31:57.6486274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6486734Z 
2025-05-07T20:31:57.6487157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6487680Z 
2025-05-07T20:31:57.6487790Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6488328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6488737Z     T=128,
2025-05-07T20:31:57.6488930Z     D=5120,
2025-05-07T20:31:57.6489129Z     scale_ub=None,
2025-05-07T20:31:57.6489345Z     contiguous=False,
2025-05-07T20:31:57.6489578Z     compiled=False,
2025-05-07T20:31:57.6489795Z )
2025-05-07T20:31:57.6490118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6490622Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6490891Z 
2025-05-07T20:31:57.6490979Z     @given(
2025-05-07T20:31:57.6491216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6491529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6491847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6492181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6492508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6492802Z     )
2025-05-07T20:31:57.6493164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6493665Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6493917Z         self,
2025-05-07T20:31:57.6494116Z         T: int,
2025-05-07T20:31:57.6494311Z         D: int,
2025-05-07T20:31:57.6494541Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6494821Z         contiguous: bool,
2025-05-07T20:31:57.6495060Z         compiled: bool,
2025-05-07T20:31:57.6495291Z     ) -> None:
2025-05-07T20:31:57.6495511Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6495754Z     
2025-05-07T20:31:57.6496021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6496370Z     
2025-05-07T20:31:57.6496567Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6496857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6497169Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6497420Z         x0 = x[:, :D]
2025-05-07T20:31:57.6497634Z         x1 = x[:, D:]
2025-05-07T20:31:57.6497860Z     
2025-05-07T20:31:57.6498106Z         if contiguous:
2025-05-07T20:31:57.6498343Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6498607Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6498857Z     
2025-05-07T20:31:57.6499043Z         if scale_ub is not None:
2025-05-07T20:31:57.6499325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6499665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6499972Z             )
2025-05-07T20:31:57.6500168Z         else:
2025-05-07T20:31:57.6500380Z             scale_ub_tensor = None
2025-05-07T20:31:57.6500629Z     
2025-05-07T20:31:57.6500866Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6501186Z             op = silu_mul_quant
2025-05-07T20:31:57.6501437Z             if compiled:
2025-05-07T20:31:57.6501677Z                 op = torch.compile(op)
2025-05-07T20:31:57.6501979Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6502258Z     
2025-05-07T20:31:57.6502455Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6502626Z 
2025-05-07T20:31:57.6502725Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6503026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6503357Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6503646Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6504341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6505034Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6505572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6506250Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6506922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6507549Z     kernel = self.compile(
2025-05-07T20:31:57.6508098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6508758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6509161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6509387Z 
2025-05-07T20:31:57.6509603Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3001b87d0>
2025-05-07T20:31:57.6510682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6512036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3011e77e0>}
2025-05-07T20:31:57.6513428Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6514453Z context = <triton._C.libtriton.ir.context object at 0x7fd300118d30>
2025-05-07T20:31:57.6514737Z 
2025-05-07T20:31:57.6514910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6515428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6515896Z                            module_map=module_map)
2025-05-07T20:31:57.6516265Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6516618Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6516874Z E       ^
2025-05-07T20:31:57.6517339Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6517791Z 
2025-05-07T20:31:57.6518272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6518786Z 
2025-05-07T20:31:57.6518896Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6519307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6519710Z     T=128,
2025-05-07T20:31:57.6519898Z     D=5120,
2025-05-07T20:31:57.6520083Z     scale_ub=1200.0,
2025-05-07T20:31:57.6520306Z     contiguous=True,
2025-05-07T20:31:57.6520528Z     compiled=False,
2025-05-07T20:31:57.6520728Z )
2025-05-07T20:31:57.7879450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7880176Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.7880583Z 
2025-05-07T20:31:57.7880666Z     @given(
2025-05-07T20:31:57.7880905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7881253Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7881582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7881918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7882244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7882536Z     )
2025-05-07T20:31:57.7882894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7883433Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7883683Z         self,
2025-05-07T20:31:57.7883883Z         T: int,
2025-05-07T20:31:57.7884078Z         D: int,
2025-05-07T20:31:57.7884307Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7884584Z         contiguous: bool,
2025-05-07T20:31:57.7884829Z         compiled: bool,
2025-05-07T20:31:57.7885056Z     ) -> None:
2025-05-07T20:31:57.7885278Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7885526Z     
2025-05-07T20:31:57.7885802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7886546Z     
2025-05-07T20:31:57.7886752Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.7887041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.7887353Z         x = x_sign * x_clamp
2025-05-07T20:31:57.7887593Z         x0 = x[:, :D]
2025-05-07T20:31:57.7887805Z         x1 = x[:, D:]
2025-05-07T20:31:57.7888013Z     
2025-05-07T20:31:57.7888199Z         if contiguous:
2025-05-07T20:31:57.7888432Z             x0 = x0.contiguous()
2025-05-07T20:31:57.7888692Z             x1 = x1.contiguous()
2025-05-07T20:31:57.7888935Z     
2025-05-07T20:31:57.7889125Z         if scale_ub is not None:
2025-05-07T20:31:57.7889402Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.7889742Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.7890053Z             )
2025-05-07T20:31:57.7890245Z         else:
2025-05-07T20:31:57.7890462Z             scale_ub_tensor = None
2025-05-07T20:31:57.7890717Z     
2025-05-07T20:31:57.7891068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.7891391Z             op = silu_mul_quant
2025-05-07T20:31:57.7891643Z             if compiled:
2025-05-07T20:31:57.7891885Z                 op = torch.compile(op)
2025-05-07T20:31:57.7892184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7892465Z     
2025-05-07T20:31:57.7892653Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.7892828Z 
2025-05-07T20:31:57.7892931Z moe/activation_test.py:117: 
2025-05-07T20:31:57.7893242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7893572Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.7893857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7894555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.7895249Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.7895790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.7896586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.7897455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.7897999Z     kernel = self.compile(
2025-05-07T20:31:57.7898546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.7899218Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.7899627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7899859Z 
2025-05-07T20:31:57.7900070Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30013f5d0>
2025-05-07T20:31:57.7901152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.7902552Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3011e51c0>}
2025-05-07T20:31:57.7903896Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.7904930Z context = <triton._C.libtriton.ir.context object at 0x7fd30018fc30>
2025-05-07T20:31:57.7905215Z 
2025-05-07T20:31:57.7905384Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.7905914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.7906389Z                            module_map=module_map)
2025-05-07T20:31:57.7906851Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.7907221Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.7907486Z E       ^
2025-05-07T20:31:57.7907964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.7908415Z 
2025-05-07T20:31:57.7908838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.7909358Z 
2025-05-07T20:31:57.7909464Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7909882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7910288Z     T=1,
2025-05-07T20:31:57.7910468Z     D=7168,
2025-05-07T20:31:57.7910666Z     scale_ub=1200.0,
2025-05-07T20:31:57.7910896Z     contiguous=True,
2025-05-07T20:31:57.7911115Z     compiled=True,
2025-05-07T20:31:57.7911345Z )
2025-05-07T20:31:57.7911671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7912245Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.7912507Z 
2025-05-07T20:31:57.7912594Z     @given(
2025-05-07T20:31:57.7912829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7913138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7913453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7913786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7914112Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7914410Z     )
2025-05-07T20:31:57.7914763Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7915201Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7915446Z         self,
2025-05-07T20:31:57.7915642Z         T: int,
2025-05-07T20:31:57.7915838Z         D: int,
2025-05-07T20:31:57.7916064Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7916355Z         contiguous: bool,
2025-05-07T20:31:57.7916636Z         compiled: bool,
2025-05-07T20:31:57.7916863Z     ) -> None:
2025-05-07T20:31:57.7917080Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7917321Z     
2025-05-07T20:31:57.7917588Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7917936Z     
2025-05-07T20:31:57.7918138Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.7918425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.7918739Z         x = x_sign * x_clamp
2025-05-07T20:31:57.7918986Z         x0 = x[:, :D]
2025-05-07T20:31:57.7919197Z         x1 = x[:, D:]
2025-05-07T20:31:57.7919414Z     
2025-05-07T20:31:57.7919602Z         if contiguous:
2025-05-07T20:31:57.7919832Z             x0 = x0.contiguous()
2025-05-07T20:31:57.7920093Z             x1 = x1.contiguous()
2025-05-07T20:31:57.7920340Z     
2025-05-07T20:31:57.7920527Z         if scale_ub is not None:
2025-05-07T20:31:57.7920809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.7921155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.7921463Z             )
2025-05-07T20:31:57.7921656Z         else:
2025-05-07T20:31:57.7921869Z             scale_ub_tensor = None
2025-05-07T20:31:57.7922120Z     
2025-05-07T20:31:57.7922358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.7922703Z             op = silu_mul_quant
2025-05-07T20:31:57.7922987Z             if compiled:
2025-05-07T20:31:57.7923415Z                 op = torch.compile(op)
2025-05-07T20:31:57.7923721Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7924005Z     
2025-05-07T20:31:57.7924197Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.7924366Z 
2025-05-07T20:31:57.7924467Z moe/activation_test.py:117: 
2025-05-07T20:31:57.7924768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7925098Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.7925480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7926052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.7926616Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.7927275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.7927971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.7928516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.7929198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.7929869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.7930415Z     kernel = self.compile(
2025-05-07T20:31:57.7930972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.7931676Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.7932081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7932309Z 
2025-05-07T20:31:57.7932526Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300856950>
2025-05-07T20:31:57.7933601Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.7934956Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3011e67a0>}
2025-05-07T20:31:57.7936303Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.7937374Z context = <triton._C.libtriton.ir.context object at 0x7fd3008befb0>
2025-05-07T20:31:57.7937661Z 
2025-05-07T20:31:57.7937834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.7938351Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.7939056Z                            module_map=module_map)
2025-05-07T20:31:57.7939426Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.7939782Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.7940038Z E       ^
2025-05-07T20:31:57.7940506Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.7940958Z 
2025-05-07T20:31:57.7941385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.7941902Z 
2025-05-07T20:31:57.7942013Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7942417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7942819Z     T=1,
2025-05-07T20:31:57.7943007Z     D=7168,
2025-05-07T20:31:57.7943197Z     scale_ub=1200.0,
2025-05-07T20:31:57.7943423Z     contiguous=False,
2025-05-07T20:31:57.7943654Z     compiled=True,
2025-05-07T20:31:57.7943851Z )
2025-05-07T20:31:58.0869718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.0870325Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.0870600Z 
2025-05-07T20:31:58.0870697Z     @given(
2025-05-07T20:31:58.0870945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.0871276Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.0871598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.0872259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.0872613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.0872913Z     )
2025-05-07T20:31:58.0873277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.0873753Z     def test_silu_mul_quant(
2025-05-07T20:31:58.0874005Z         self,
2025-05-07T20:31:58.0874215Z         T: int,
2025-05-07T20:31:58.0874423Z         D: int,
2025-05-07T20:31:58.0874642Z         scale_ub: Optional[float],
2025-05-07T20:31:58.0874929Z         contiguous: bool,
2025-05-07T20:31:58.0875179Z         compiled: bool,
2025-05-07T20:31:58.0875413Z     ) -> None:
2025-05-07T20:31:58.0875638Z         torch.manual_seed(2025)
2025-05-07T20:31:58.0875887Z     
2025-05-07T20:31:58.0876166Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.0876528Z     
2025-05-07T20:31:58.0876737Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.0877037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.0877470Z         x = x_sign * x_clamp
2025-05-07T20:31:58.0877720Z         x0 = x[:, :D]
2025-05-07T20:31:58.0877953Z         x1 = x[:, D:]
2025-05-07T20:31:58.0878165Z     
2025-05-07T20:31:58.0878362Z         if contiguous:
2025-05-07T20:31:58.0878608Z             x0 = x0.contiguous()
2025-05-07T20:31:58.0878875Z             x1 = x1.contiguous()
2025-05-07T20:31:58.0879124Z     
2025-05-07T20:31:58.0879330Z         if scale_ub is not None:
2025-05-07T20:31:58.0879614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.0879971Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.0880287Z             )
2025-05-07T20:31:58.0880486Z         else:
2025-05-07T20:31:58.0880707Z             scale_ub_tensor = None
2025-05-07T20:31:58.0880969Z     
2025-05-07T20:31:58.0881212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.0881536Z             op = silu_mul_quant
2025-05-07T20:31:58.0881816Z             if compiled:
2025-05-07T20:31:58.0882173Z                 op = torch.compile(op)
2025-05-07T20:31:58.0882489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.0882826Z     
2025-05-07T20:31:58.0883037Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.0883339Z 
2025-05-07T20:31:58.0883448Z moe/activation_test.py:117: 
2025-05-07T20:31:58.0883762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.0884111Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.0884395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.0884975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.0885557Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.0886225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.0886926Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.0887491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.0888191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.0888863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.0889410Z     kernel = self.compile(
2025-05-07T20:31:58.0889966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.0890634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.0891036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.0891272Z 
2025-05-07T20:31:58.0891483Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300870450>
2025-05-07T20:31:58.0892678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.0894120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd301bb76a0>}
2025-05-07T20:31:58.0895478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.0896515Z context = <triton._C.libtriton.ir.context object at 0x7fd300838ab0>
2025-05-07T20:31:58.0896808Z 
2025-05-07T20:31:58.0896984Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.0897513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.0898000Z                            module_map=module_map)
2025-05-07T20:31:58.0898434Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.0909016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.0909325Z E       ^
2025-05-07T20:31:58.0909811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.0910268Z 
2025-05-07T20:31:58.0910701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.0911229Z 
2025-05-07T20:31:58.0911345Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.0911785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.0912209Z     T=1,
2025-05-07T20:31:58.0912406Z     D=7168,
2025-05-07T20:31:58.0912620Z     scale_ub=None,
2025-05-07T20:31:58.0912859Z     contiguous=False,
2025-05-07T20:31:58.0913093Z     compiled=True,
2025-05-07T20:31:58.0913330Z )
2025-05-07T20:31:58.1581220Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.1581952Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:58.1582227Z 
2025-05-07T20:31:58.1582312Z     @given(
2025-05-07T20:31:58.1582558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.1582932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.1583259Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.1583594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.1583933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.1584232Z     )
2025-05-07T20:31:58.1584586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.1585041Z     def test_silu_mul_quant(
2025-05-07T20:31:58.1585297Z         self,
2025-05-07T20:31:58.1585495Z         T: int,
2025-05-07T20:31:58.1585704Z         D: int,
2025-05-07T20:31:58.1585946Z         scale_ub: Optional[float],
2025-05-07T20:31:58.1586226Z         contiguous: bool,
2025-05-07T20:31:58.1586477Z         compiled: bool,
2025-05-07T20:31:58.1586721Z     ) -> None:
2025-05-07T20:31:58.1586945Z         torch.manual_seed(2025)
2025-05-07T20:31:58.1587199Z     
2025-05-07T20:31:58.1587487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.1587840Z     
2025-05-07T20:31:58.1588040Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.1588342Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.1588662Z         x = x_sign * x_clamp
2025-05-07T20:31:58.1588907Z         x0 = x[:, :D]
2025-05-07T20:31:58.1589134Z         x1 = x[:, D:]
2025-05-07T20:31:58.1589349Z     
2025-05-07T20:31:58.1589546Z         if contiguous:
2025-05-07T20:31:58.1589796Z             x0 = x0.contiguous()
2025-05-07T20:31:58.1590065Z             x1 = x1.contiguous()
2025-05-07T20:31:58.1590309Z     
2025-05-07T20:31:58.1590781Z         if scale_ub is not None:
2025-05-07T20:31:58.1591081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.1591420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.1591742Z             )
2025-05-07T20:31:58.1591947Z         else:
2025-05-07T20:31:58.1592164Z             scale_ub_tensor = None
2025-05-07T20:31:58.1592428Z     
2025-05-07T20:31:58.1592682Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.1593051Z             op = silu_mul_quant
2025-05-07T20:31:58.1593315Z             if compiled:
2025-05-07T20:31:58.1593579Z                 op = torch.compile(op)
2025-05-07T20:31:58.1593891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.1594169Z     
2025-05-07T20:31:58.1594378Z         y_fp8, y_scale = fn()
2025-05-07T20:31:58.1594687Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:58.1594981Z     
2025-05-07T20:31:58.1595231Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.1595689Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:58.1595991Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:58.1596317Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:58.1596690Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.1597003Z     
2025-05-07T20:31:58.1597214Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:58.1597418Z 
2025-05-07T20:31:58.1597525Z moe/activation_test.py:126: 
2025-05-07T20:31:58.1597830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.1598169Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:58.1598508Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.1599307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:58.1600069Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:58.1600635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.1601417Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.1602120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:58.1602848Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.1603748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:58.1604510Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.1605255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:58.1605899Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:58.1606522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:58.1607053Z     fn()
2025-05-07T20:31:58.1607568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:58.1608158Z     self.fn.run(
2025-05-07T20:31:58.1608645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.1609188Z     kernel = self.compile(
2025-05-07T20:31:58.1609736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.1610404Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.1610823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.1611057Z 
2025-05-07T20:31:58.1611370Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3006aec50>
2025-05-07T20:31:58.1612460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.1613859Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3968dede0>}
2025-05-07T20:31:58.1615211Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.1616247Z context = <triton._C.libtriton.ir.context object at 0x7fd3006d44b0>
2025-05-07T20:31:58.1616537Z 
2025-05-07T20:31:58.1616709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.1617252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.1617786Z                            module_map=module_map)
2025-05-07T20:31:58.1618164Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.1618529Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:58.1618809Z E       ^
2025-05-07T20:31:58.1619282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.1619740Z 
2025-05-07T20:31:58.1620164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.1620691Z 
2025-05-07T20:31:58.1620801Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.1621227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.1621642Z     T=1,
2025-05-07T20:31:58.1621835Z     D=5120,
2025-05-07T20:31:58.1622041Z     scale_ub=1200.0,
2025-05-07T20:31:58.1622290Z     contiguous=False,
2025-05-07T20:31:58.1622572Z     compiled=True,
2025-05-07T20:31:58.1622815Z )
2025-05-07T20:31:58.2832830Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.2834268Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.2834831Z 
2025-05-07T20:31:58.2835004Z     @given(
2025-05-07T20:31:58.2835486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.2836121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.2836836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.2837509Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.2838168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.2839034Z     )
2025-05-07T20:31:58.2839751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.2840645Z     def test_silu_mul_quant(
2025-05-07T20:31:58.2841168Z         self,
2025-05-07T20:31:58.2841591Z         T: int,
2025-05-07T20:31:58.2841985Z         D: int,
2025-05-07T20:31:58.2842437Z         scale_ub: Optional[float],
2025-05-07T20:31:58.2842830Z         contiguous: bool,
2025-05-07T20:31:58.2843111Z         compiled: bool,
2025-05-07T20:31:58.2843452Z     ) -> None:
2025-05-07T20:31:58.2843682Z         torch.manual_seed(2025)
2025-05-07T20:31:58.2843934Z     
2025-05-07T20:31:58.2844214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.2844572Z     
2025-05-07T20:31:58.2844779Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.2845076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.2845396Z         x = x_sign * x_clamp
2025-05-07T20:31:58.2845647Z         x0 = x[:, :D]
2025-05-07T20:31:58.2845869Z         x1 = x[:, D:]
2025-05-07T20:31:58.2846086Z     
2025-05-07T20:31:58.2846283Z         if contiguous:
2025-05-07T20:31:58.2846522Z             x0 = x0.contiguous()
2025-05-07T20:31:58.2847099Z             x1 = x1.contiguous()
2025-05-07T20:31:58.2847360Z     
2025-05-07T20:31:58.2847559Z         if scale_ub is not None:
2025-05-07T20:31:58.2847848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.2848194Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.2848516Z             )
2025-05-07T20:31:58.2848713Z         else:
2025-05-07T20:31:58.2848932Z             scale_ub_tensor = None
2025-05-07T20:31:58.2849195Z     
2025-05-07T20:31:58.2849431Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.2849756Z             op = silu_mul_quant
2025-05-07T20:31:58.2850019Z             if compiled:
2025-05-07T20:31:58.2850275Z                 op = torch.compile(op)
2025-05-07T20:31:58.2850580Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2850871Z     
2025-05-07T20:31:58.2851067Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.2851243Z 
2025-05-07T20:31:58.2851348Z moe/activation_test.py:117: 
2025-05-07T20:31:58.2851755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2852096Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.2852388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2852969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.2853548Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.2854218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.2854927Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.2855485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.2856180Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.2856882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.2857537Z     kernel = self.compile(
2025-05-07T20:31:58.2858093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.2858780Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.2859191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2859426Z 
2025-05-07T20:31:58.2859648Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30062f210>
2025-05-07T20:31:58.2860757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.2862161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396161d00>}
2025-05-07T20:31:58.2863588Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.2864641Z context = <triton._C.libtriton.ir.context object at 0x7fd300667870>
2025-05-07T20:31:58.2864936Z 
2025-05-07T20:31:58.2865120Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.2865656Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.2866144Z                            module_map=module_map)
2025-05-07T20:31:58.2866525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.2866889Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.2867160Z E       ^
2025-05-07T20:31:58.2867776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.2868247Z 
2025-05-07T20:31:58.2868684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.2869212Z 
2025-05-07T20:31:58.2869322Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.2869749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.2870164Z     T=1,
2025-05-07T20:31:58.2870354Z     D=5120,
2025-05-07T20:31:58.2870558Z     scale_ub=1200.0,
2025-05-07T20:31:58.2870796Z     contiguous=False,
2025-05-07T20:31:58.2871027Z     compiled=False,
2025-05-07T20:31:58.2871243Z )
2025-05-07T20:31:58.2871574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.2872087Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:58.2872366Z 
2025-05-07T20:31:58.2872450Z     @given(
2025-05-07T20:31:58.2872690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.2873077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.2873389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.2873734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.2874079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.2874374Z     )
2025-05-07T20:31:58.2874741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.2875206Z     def test_silu_mul_quant(
2025-05-07T20:31:58.2875460Z         self,
2025-05-07T20:31:58.2875658Z         T: int,
2025-05-07T20:31:58.2875862Z         D: int,
2025-05-07T20:31:58.2876090Z         scale_ub: Optional[float],
2025-05-07T20:31:58.2876368Z         contiguous: bool,
2025-05-07T20:31:58.2876619Z         compiled: bool,
2025-05-07T20:31:58.2876857Z     ) -> None:
2025-05-07T20:31:58.2877076Z         torch.manual_seed(2025)
2025-05-07T20:31:58.2877329Z     
2025-05-07T20:31:58.2877622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.2878029Z     
2025-05-07T20:31:58.2878236Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.2878539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.2878856Z         x = x_sign * x_clamp
2025-05-07T20:31:58.2879115Z         x0 = x[:, :D]
2025-05-07T20:31:58.2879347Z         x1 = x[:, D:]
2025-05-07T20:31:58.2879560Z     
2025-05-07T20:31:58.2879761Z         if contiguous:
2025-05-07T20:31:58.2880005Z             x0 = x0.contiguous()
2025-05-07T20:31:58.2880267Z             x1 = x1.contiguous()
2025-05-07T20:31:58.2880520Z     
2025-05-07T20:31:58.2880722Z         if scale_ub is not None:
2025-05-07T20:31:58.2881006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.2881355Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.2881679Z             )
2025-05-07T20:31:58.2881887Z         else:
2025-05-07T20:31:58.2882102Z             scale_ub_tensor = None
2025-05-07T20:31:58.2882372Z     
2025-05-07T20:31:58.2882633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.2882953Z             op = silu_mul_quant
2025-05-07T20:31:58.2883276Z             if compiled:
2025-05-07T20:31:58.2883532Z                 op = torch.compile(op)
2025-05-07T20:31:58.2883833Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2884117Z     
2025-05-07T20:31:58.2884321Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.2884488Z 
2025-05-07T20:31:58.2884593Z moe/activation_test.py:117: 
2025-05-07T20:31:58.2884899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2885242Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.2885535Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2886236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.2886942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.2887588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.2888287Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.2888975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.2889526Z     kernel = self.compile(
2025-05-07T20:31:58.2890087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.2890760Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.2891170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2891403Z 
2025-05-07T20:31:58.2891622Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300670e50>
2025-05-07T20:31:58.2892730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.2894159Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3006d8b80>}
2025-05-07T20:31:58.2895536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.2896583Z context = <triton._C.libtriton.ir.context object at 0x7fd30060b3b0>
2025-05-07T20:31:58.2896876Z 
2025-05-07T20:31:58.2897054Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.2897591Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.2898088Z                            module_map=module_map)
2025-05-07T20:31:58.2898508Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.2898880Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.2899145Z E       ^
2025-05-07T20:31:58.2899628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.2900089Z 
2025-05-07T20:31:58.2900528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.2901057Z 
2025-05-07T20:31:58.2901175Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.2901599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.2902018Z     T=16384,
2025-05-07T20:31:58.2902227Z     D=5120,
2025-05-07T20:31:58.2902423Z     scale_ub=1200.0,
2025-05-07T20:31:58.2902666Z     contiguous=False,
2025-05-07T20:31:58.2902909Z     compiled=True,
2025-05-07T20:31:58.2903120Z )
2025-05-07T20:31:58.3601951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.3602708Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.3603005Z 
2025-05-07T20:31:58.3603092Z     @given(
2025-05-07T20:31:58.3603452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.3603770Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.3604085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.3604421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.3604756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.3605042Z     )
2025-05-07T20:31:58.3605396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.3605848Z     def test_silu_mul_quant(
2025-05-07T20:31:58.3606093Z         self,
2025-05-07T20:31:58.3606299Z         T: int,
2025-05-07T20:31:58.3606509Z         D: int,
2025-05-07T20:31:58.3606925Z         scale_ub: Optional[float],
2025-05-07T20:31:58.3607220Z         contiguous: bool,
2025-05-07T20:31:58.3607466Z         compiled: bool,
2025-05-07T20:31:58.3607695Z     ) -> None:
2025-05-07T20:31:58.3607926Z         torch.manual_seed(2025)
2025-05-07T20:31:58.3608183Z     
2025-05-07T20:31:58.3608463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.3608815Z     
2025-05-07T20:31:58.3609025Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.3609318Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.3609639Z         x = x_sign * x_clamp
2025-05-07T20:31:58.3609889Z         x0 = x[:, :D]
2025-05-07T20:31:58.3610114Z         x1 = x[:, D:]
2025-05-07T20:31:58.3610323Z     
2025-05-07T20:31:58.3610524Z         if contiguous:
2025-05-07T20:31:58.3610769Z             x0 = x0.contiguous()
2025-05-07T20:31:58.3611031Z             x1 = x1.contiguous()
2025-05-07T20:31:58.3611281Z     
2025-05-07T20:31:58.3611485Z         if scale_ub is not None:
2025-05-07T20:31:58.3611846Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.3612198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.3612516Z             )
2025-05-07T20:31:58.3612712Z         else:
2025-05-07T20:31:58.3612935Z             scale_ub_tensor = None
2025-05-07T20:31:58.3613195Z     
2025-05-07T20:31:58.3613432Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.3613764Z             op = silu_mul_quant
2025-05-07T20:31:58.3614024Z             if compiled:
2025-05-07T20:31:58.3614274Z                 op = torch.compile(op)
2025-05-07T20:31:58.3614579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3614863Z     
2025-05-07T20:31:58.3615060Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.3615235Z 
2025-05-07T20:31:58.3615338Z moe/activation_test.py:117: 
2025-05-07T20:31:58.3615648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3615997Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.3616356Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3616928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.3617504Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.3618170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.3618877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.3619426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.3620119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.3620788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.3621338Z     kernel = self.compile(
2025-05-07T20:31:58.3621895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.3622569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.3622975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3623214Z 
2025-05-07T20:31:58.3623431Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfda7ad0>
2025-05-07T20:31:58.3624518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.3625892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3006d9e40>}
2025-05-07T20:31:58.3627334Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.3628378Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfd34130>
2025-05-07T20:31:58.3628677Z 
2025-05-07T20:31:58.3628846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.3629383Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.3629867Z                            module_map=module_map)
2025-05-07T20:31:58.3630242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.3630608Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.3630869Z E       ^
2025-05-07T20:31:58.3631342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.3631815Z 
2025-05-07T20:31:58.3632250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.3632824Z 
2025-05-07T20:31:58.3632939Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.3633355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.3633766Z     T=2048,
2025-05-07T20:31:58.3633967Z     D=7168,
2025-05-07T20:31:58.3634168Z     scale_ub=1200.0,
2025-05-07T20:31:58.3634400Z     contiguous=False,
2025-05-07T20:31:58.3634636Z     compiled=True,
2025-05-07T20:31:58.3634857Z )
2025-05-07T20:31:58.3635185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.3635691Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.3635970Z 
2025-05-07T20:31:58.3636060Z     @given(
2025-05-07T20:31:58.3636295Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.3636620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.3636943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.3637324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.3637666Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.3637960Z     )
2025-05-07T20:31:58.3638312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.3638998Z     def test_silu_mul_quant(
2025-05-07T20:31:58.3639513Z         self,
2025-05-07T20:31:58.3639765Z         T: int,
2025-05-07T20:31:58.3640041Z         D: int,
2025-05-07T20:31:58.3640497Z         scale_ub: Optional[float],
2025-05-07T20:31:58.3640829Z         contiguous: bool,
2025-05-07T20:31:58.3641144Z         compiled: bool,
2025-05-07T20:31:58.3652038Z     ) -> None:
2025-05-07T20:31:58.3652408Z         torch.manual_seed(2025)
2025-05-07T20:31:58.3652826Z     
2025-05-07T20:31:58.3653158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.3653514Z     
2025-05-07T20:31:58.3653719Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.3654029Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.3654353Z         x = x_sign * x_clamp
2025-05-07T20:31:58.3654604Z         x0 = x[:, :D]
2025-05-07T20:31:58.3654821Z         x1 = x[:, D:]
2025-05-07T20:31:58.3655033Z     
2025-05-07T20:31:58.3655230Z         if contiguous:
2025-05-07T20:31:58.3655464Z             x0 = x0.contiguous()
2025-05-07T20:31:58.3655732Z             x1 = x1.contiguous()
2025-05-07T20:31:58.3655985Z     
2025-05-07T20:31:58.3656176Z         if scale_ub is not None:
2025-05-07T20:31:58.3656459Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.3656801Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.3657111Z             )
2025-05-07T20:31:58.3657315Z         else:
2025-05-07T20:31:58.3657534Z             scale_ub_tensor = None
2025-05-07T20:31:58.3657788Z     
2025-05-07T20:31:58.3658032Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.3658544Z             op = silu_mul_quant
2025-05-07T20:31:58.3658814Z             if compiled:
2025-05-07T20:31:58.3659065Z                 op = torch.compile(op)
2025-05-07T20:31:58.3659369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3659652Z     
2025-05-07T20:31:58.3659853Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.3660032Z 
2025-05-07T20:31:58.3660137Z moe/activation_test.py:117: 
2025-05-07T20:31:58.3660443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3660783Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.3661075Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3661652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.3662229Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.3662891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.3663688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.3664241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.3664930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.3665605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.3666151Z     kernel = self.compile(
2025-05-07T20:31:58.3666707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.3667368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.3667777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3668008Z 
2025-05-07T20:31:58.3668230Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfd4ce10>
2025-05-07T20:31:58.3669319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.3670756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3006da980>}
2025-05-07T20:31:58.3672116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.3673202Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfdf9470>
2025-05-07T20:31:58.3673493Z 
2025-05-07T20:31:58.3673676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.3674204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.3674687Z                            module_map=module_map)
2025-05-07T20:31:58.3675065Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.3675440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.3675704Z E       ^
2025-05-07T20:31:58.3676181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.3676630Z 
2025-05-07T20:31:58.3677053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.3677575Z 
2025-05-07T20:31:58.4564030Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.4564480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.4564930Z     T=1,
2025-05-07T20:31:58.4565145Z     D=5120,
2025-05-07T20:31:58.4565429Z     scale_ub=None,
2025-05-07T20:31:58.4565725Z     contiguous=False,
2025-05-07T20:31:58.4566252Z     compiled=False,
2025-05-07T20:31:58.4566553Z )
2025-05-07T20:31:58.4567000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.4567539Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:58.4567809Z 
2025-05-07T20:31:58.4567891Z     @given(
2025-05-07T20:31:58.4568134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.4568463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.4568777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.4569125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.4569470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.4569771Z     )
2025-05-07T20:31:58.4570127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.4570584Z     def test_silu_mul_quant(
2025-05-07T20:31:58.4570837Z         self,
2025-05-07T20:31:58.4571041Z         T: int,
2025-05-07T20:31:58.4571336Z         D: int,
2025-05-07T20:31:58.4571571Z         scale_ub: Optional[float],
2025-05-07T20:31:58.4571849Z         contiguous: bool,
2025-05-07T20:31:58.4572102Z         compiled: bool,
2025-05-07T20:31:58.4572346Z     ) -> None:
2025-05-07T20:31:58.4572573Z         torch.manual_seed(2025)
2025-05-07T20:31:58.4572829Z     
2025-05-07T20:31:58.4573119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.4573469Z     
2025-05-07T20:31:58.4573681Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.4573985Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.4574302Z         x = x_sign * x_clamp
2025-05-07T20:31:58.4574554Z         x0 = x[:, :D]
2025-05-07T20:31:58.4574842Z         x1 = x[:, D:]
2025-05-07T20:31:58.4575149Z     
2025-05-07T20:31:58.4575345Z         if contiguous:
2025-05-07T20:31:58.4575593Z             x0 = x0.contiguous()
2025-05-07T20:31:58.4575865Z             x1 = x1.contiguous()
2025-05-07T20:31:58.4576114Z     
2025-05-07T20:31:58.4576406Z         if scale_ub is not None:
2025-05-07T20:31:58.4576694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.4577036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.4577355Z             )
2025-05-07T20:31:58.4577557Z         else:
2025-05-07T20:31:58.4577772Z             scale_ub_tensor = None
2025-05-07T20:31:58.4578032Z     
2025-05-07T20:31:58.4578278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.4578594Z             op = silu_mul_quant
2025-05-07T20:31:58.4578854Z             if compiled:
2025-05-07T20:31:58.4579109Z                 op = torch.compile(op)
2025-05-07T20:31:58.4579411Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.4579692Z     
2025-05-07T20:31:58.4579897Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.4580063Z 
2025-05-07T20:31:58.4580174Z moe/activation_test.py:117: 
2025-05-07T20:31:58.4580473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.4580818Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.4581106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.4581799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.4582501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.4583103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.4583798Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.4584469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.4585015Z     kernel = self.compile(
2025-05-07T20:31:58.4585570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.4586318Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.4586733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.4586974Z 
2025-05-07T20:31:58.4587186Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30044ab50>
2025-05-07T20:31:58.4588278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.4589648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3004b4220>}
2025-05-07T20:31:58.4591006Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.4592089Z context = <triton._C.libtriton.ir.context object at 0x7fd3004c7130>
2025-05-07T20:31:58.4592380Z 
2025-05-07T20:31:58.4592556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.4593087Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.4593560Z                            module_map=module_map)
2025-05-07T20:31:58.4593933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.4594300Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.4594561Z E       ^
2025-05-07T20:31:58.4595042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.4595499Z 
2025-05-07T20:31:58.4595934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.4596455Z 
2025-05-07T20:31:58.4596576Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.4597077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.4597482Z     T=4096,
2025-05-07T20:31:58.4597676Z     D=7168,
2025-05-07T20:31:58.4597881Z     scale_ub=1200.0,
2025-05-07T20:31:58.4598107Z     contiguous=False,
2025-05-07T20:31:58.4598342Z     compiled=False,
2025-05-07T20:31:58.4598553Z )
2025-05-07T20:31:58.4598878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.4599392Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:58.4599676Z 
2025-05-07T20:31:58.4599764Z     @given(
2025-05-07T20:31:58.4600002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.4600318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.4600632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.4600969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.4601306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.4601604Z     )
2025-05-07T20:31:58.4601963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.4602412Z     def test_silu_mul_quant(
2025-05-07T20:31:58.4602667Z         self,
2025-05-07T20:31:58.4602888Z         T: int,
2025-05-07T20:31:58.4603118Z         D: int,
2025-05-07T20:31:58.4603497Z         scale_ub: Optional[float],
2025-05-07T20:31:58.4603772Z         contiguous: bool,
2025-05-07T20:31:58.4604019Z         compiled: bool,
2025-05-07T20:31:58.4604250Z     ) -> None:
2025-05-07T20:31:58.4604474Z         torch.manual_seed(2025)
2025-05-07T20:31:58.4604721Z     
2025-05-07T20:31:58.4605004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.4605353Z     
2025-05-07T20:31:58.4605551Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.4605849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.4606279Z         x = x_sign * x_clamp
2025-05-07T20:31:58.4606524Z         x0 = x[:, :D]
2025-05-07T20:31:58.4606749Z         x1 = x[:, D:]
2025-05-07T20:31:58.4606967Z     
2025-05-07T20:31:58.4607155Z         if contiguous:
2025-05-07T20:31:58.4607392Z             x0 = x0.contiguous()
2025-05-07T20:31:58.4607657Z             x1 = x1.contiguous()
2025-05-07T20:31:58.4607899Z     
2025-05-07T20:31:58.4608095Z         if scale_ub is not None:
2025-05-07T20:31:58.4608374Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.4608717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.4609032Z             )
2025-05-07T20:31:58.4609240Z         else:
2025-05-07T20:31:58.4609454Z             scale_ub_tensor = None
2025-05-07T20:31:58.4609710Z     
2025-05-07T20:31:58.4609948Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.4610267Z             op = silu_mul_quant
2025-05-07T20:31:58.4610521Z             if compiled:
2025-05-07T20:31:58.4610782Z                 op = torch.compile(op)
2025-05-07T20:31:58.4611136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.4611411Z     
2025-05-07T20:31:58.4611611Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.4611777Z 
2025-05-07T20:31:58.4611884Z moe/activation_test.py:117: 
2025-05-07T20:31:58.4612180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.4612519Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.4612807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.4613511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.4614209Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.4614765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.4615461Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.4616143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.4616736Z     kernel = self.compile(
2025-05-07T20:31:58.4617295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.4617972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.4618373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.4618611Z 
2025-05-07T20:31:58.4618825Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30040da10>
2025-05-07T20:31:58.4619911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.4621294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3004b5440>}
2025-05-07T20:31:58.4622658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.4623697Z context = <triton._C.libtriton.ir.context object at 0x7fd300499fb0>
2025-05-07T20:31:58.4623991Z 
2025-05-07T20:31:58.4624161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.4624695Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.4625169Z                            module_map=module_map)
2025-05-07T20:31:58.4625542Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.4625909Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.4626178Z E       ^
2025-05-07T20:31:58.4626728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.4627197Z 
2025-05-07T20:31:58.4627625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.4628145Z 
2025-05-07T20:31:58.4628261Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.4628676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.4629089Z     T=16384,
2025-05-07T20:31:58.4629292Z     D=7168,
2025-05-07T20:31:58.4629494Z     scale_ub=None,
2025-05-07T20:31:58.4629707Z     contiguous=True,
2025-05-07T20:31:58.4629939Z     compiled=True,
2025-05-07T20:31:58.4630146Z )
2025-05-07T20:31:58.8012856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.8013634Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:58.8014038Z 
2025-05-07T20:31:58.8014150Z     @given(
2025-05-07T20:31:58.8014615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.8015040Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.8015414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.8015752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.8016083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.8016371Z     )
2025-05-07T20:31:58.8016726Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.8017173Z     def test_silu_mul_quant(
2025-05-07T20:31:58.8017413Z         self,
2025-05-07T20:31:58.8017612Z         T: int,
2025-05-07T20:31:58.8017816Z         D: int,
2025-05-07T20:31:58.8018034Z         scale_ub: Optional[float],
2025-05-07T20:31:58.8018311Z         contiguous: bool,
2025-05-07T20:31:58.8018554Z         compiled: bool,
2025-05-07T20:31:58.8018782Z     ) -> None:
2025-05-07T20:31:58.8019006Z         torch.manual_seed(2025)
2025-05-07T20:31:58.8019262Z     
2025-05-07T20:31:58.8019625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.8019975Z     
2025-05-07T20:31:58.8020176Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.8020468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.8020783Z         x = x_sign * x_clamp
2025-05-07T20:31:58.8021030Z         x0 = x[:, :D]
2025-05-07T20:31:58.8021254Z         x1 = x[:, D:]
2025-05-07T20:31:58.8021462Z     
2025-05-07T20:31:58.8021659Z         if contiguous:
2025-05-07T20:31:58.8021899Z             x0 = x0.contiguous()
2025-05-07T20:31:58.8022159Z             x1 = x1.contiguous()
2025-05-07T20:31:58.8022403Z     
2025-05-07T20:31:58.8022601Z         if scale_ub is not None:
2025-05-07T20:31:58.8022878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.8023259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.8023593Z             )
2025-05-07T20:31:58.8023788Z         else:
2025-05-07T20:31:58.8024011Z             scale_ub_tensor = None
2025-05-07T20:31:58.8024269Z     
2025-05-07T20:31:58.8024503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.8024821Z             op = silu_mul_quant
2025-05-07T20:31:58.8025078Z             if compiled:
2025-05-07T20:31:58.8025324Z                 op = torch.compile(op)
2025-05-07T20:31:58.8025625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.8025904Z     
2025-05-07T20:31:58.8026103Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.8026270Z 
2025-05-07T20:31:58.8026371Z moe/activation_test.py:117: 
2025-05-07T20:31:58.8026669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.8027011Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.8027290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.8027857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.8028546Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.8029215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.8029912Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.8030622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.8031323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.8031988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.8032527Z     kernel = self.compile(
2025-05-07T20:31:58.8033075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.8033741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.8034145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.8034438Z 
2025-05-07T20:31:58.8034648Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3000343d0>
2025-05-07T20:31:58.8035734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.8037108Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3004b6520>}
2025-05-07T20:31:58.8038637Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.8039675Z context = <triton._C.libtriton.ir.context object at 0x7fd3000e8a30>
2025-05-07T20:31:58.8039974Z 
2025-05-07T20:31:58.8040230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.8040761Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.8041231Z                            module_map=module_map)
2025-05-07T20:31:58.8041603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.8041969Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.8042230Z E       ^
2025-05-07T20:31:58.8042704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.8043355Z 
2025-05-07T20:31:58.8043781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.8044303Z 
2025-05-07T20:31:58.8044415Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.8044829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.8045253Z     T=4096,
2025-05-07T20:31:58.8045449Z     D=5120,
2025-05-07T20:31:58.8045644Z     scale_ub=None,
2025-05-07T20:31:58.8045869Z     contiguous=False,
2025-05-07T20:31:58.8046101Z     compiled=True,
2025-05-07T20:31:58.8046311Z )
2025-05-07T20:31:58.8046659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.8047160Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:58.8047439Z 
2025-05-07T20:31:58.8047522Z     @given(
2025-05-07T20:31:58.8047755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.8048069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.8048383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.8048714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.8049039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.8049333Z     )
2025-05-07T20:31:58.8049809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.8050271Z     def test_silu_mul_quant(
2025-05-07T20:31:58.8050514Z         self,
2025-05-07T20:31:58.8050717Z         T: int,
2025-05-07T20:31:58.8050921Z         D: int,
2025-05-07T20:31:58.8051143Z         scale_ub: Optional[float],
2025-05-07T20:31:58.8051425Z         contiguous: bool,
2025-05-07T20:31:58.8051674Z         compiled: bool,
2025-05-07T20:31:58.8051900Z     ) -> None:
2025-05-07T20:31:58.8052124Z         torch.manual_seed(2025)
2025-05-07T20:31:58.8052372Z     
2025-05-07T20:31:58.8052649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.8053005Z     
2025-05-07T20:31:58.8053212Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.8053510Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.8053829Z         x = x_sign * x_clamp
2025-05-07T20:31:58.8054074Z         x0 = x[:, :D]
2025-05-07T20:31:58.8054300Z         x1 = x[:, D:]
2025-05-07T20:31:58.8054588Z     
2025-05-07T20:31:58.8054789Z         if contiguous:
2025-05-07T20:31:58.8055025Z             x0 = x0.contiguous()
2025-05-07T20:31:58.8055291Z             x1 = x1.contiguous()
2025-05-07T20:31:58.8055546Z     
2025-05-07T20:31:58.8055748Z         if scale_ub is not None:
2025-05-07T20:31:58.8056026Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.8056370Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.8056683Z             )
2025-05-07T20:31:58.8056874Z         else:
2025-05-07T20:31:58.8057091Z             scale_ub_tensor = None
2025-05-07T20:31:58.8057347Z     
2025-05-07T20:31:58.8057577Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.8057892Z             op = silu_mul_quant
2025-05-07T20:31:58.8058149Z             if compiled:
2025-05-07T20:31:58.8058398Z                 op = torch.compile(op)
2025-05-07T20:31:58.8058702Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.8058980Z     
2025-05-07T20:31:58.8059182Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.8059403Z 
2025-05-07T20:31:58.8059505Z moe/activation_test.py:117: 
2025-05-07T20:31:58.8059804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.8060139Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.8060424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.8060989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.8061559Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.8062219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.8063038Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.8063679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.8064539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.8073800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.8074372Z     kernel = self.compile(
2025-05-07T20:31:58.8074922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.8075591Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.8075991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.8076220Z 
2025-05-07T20:31:58.8076431Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3000e59d0>
2025-05-07T20:31:58.8077516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.8079009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3004b6c00>}
2025-05-07T20:31:58.8080373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.8081405Z context = <triton._C.libtriton.ir.context object at 0x7fd300032030>
2025-05-07T20:31:58.8081694Z 
2025-05-07T20:31:58.8081869Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.8082398Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.8082870Z                            module_map=module_map)
2025-05-07T20:31:58.8083323Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.8083728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.8084047Z E       ^
2025-05-07T20:31:58.8084529Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.8084983Z 
2025-05-07T20:31:58.8085404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.8085925Z 
2025-05-07T20:31:58.9237871Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.9238656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.9239230Z     T=4096,
2025-05-07T20:31:58.9239493Z     D=5120,
2025-05-07T20:31:58.9239754Z     scale_ub=1200.0,
2025-05-07T20:31:58.9240070Z     contiguous=False,
2025-05-07T20:31:58.9240379Z     compiled=False,
2025-05-07T20:31:58.9240667Z )
2025-05-07T20:31:58.9241110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.9241699Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:58.9242116Z 
2025-05-07T20:31:58.9242205Z     @given(
2025-05-07T20:31:58.9242441Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.9242765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.9243079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.9243577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.9243915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.9244206Z     )
2025-05-07T20:31:58.9244562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.9245008Z     def test_silu_mul_quant(
2025-05-07T20:31:58.9245264Z         self,
2025-05-07T20:31:58.9245469Z         T: int,
2025-05-07T20:31:58.9245673Z         D: int,
2025-05-07T20:31:58.9245899Z         scale_ub: Optional[float],
2025-05-07T20:31:58.9246179Z         contiguous: bool,
2025-05-07T20:31:58.9246427Z         compiled: bool,
2025-05-07T20:31:58.9246661Z     ) -> None:
2025-05-07T20:31:58.9246893Z         torch.manual_seed(2025)
2025-05-07T20:31:58.9247137Z     
2025-05-07T20:31:58.9247416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.9247757Z     
2025-05-07T20:31:58.9247954Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.9248258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.9248571Z         x = x_sign * x_clamp
2025-05-07T20:31:58.9248812Z         x0 = x[:, :D]
2025-05-07T20:31:58.9249043Z         x1 = x[:, D:]
2025-05-07T20:31:58.9249258Z     
2025-05-07T20:31:58.9249452Z         if contiguous:
2025-05-07T20:31:58.9249681Z             x0 = x0.contiguous()
2025-05-07T20:31:58.9249944Z             x1 = x1.contiguous()
2025-05-07T20:31:58.9250179Z     
2025-05-07T20:31:58.9250371Z         if scale_ub is not None:
2025-05-07T20:31:58.9250646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.9250988Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.9251430Z             )
2025-05-07T20:31:58.9251635Z         else:
2025-05-07T20:31:58.9251853Z             scale_ub_tensor = None
2025-05-07T20:31:58.9252111Z     
2025-05-07T20:31:58.9252344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.9252664Z             op = silu_mul_quant
2025-05-07T20:31:58.9252947Z             if compiled:
2025-05-07T20:31:58.9253223Z                 op = torch.compile(op)
2025-05-07T20:31:58.9253529Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.9253817Z     
2025-05-07T20:31:58.9254014Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.9254184Z 
2025-05-07T20:31:58.9254287Z moe/activation_test.py:117: 
2025-05-07T20:31:58.9254586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.9254924Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.9255207Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.9255916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.9256835Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.9257474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.9258305Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.9259115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.9259759Z     kernel = self.compile(
2025-05-07T20:31:58.9260400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.9261197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.9261655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.9261923Z 
2025-05-07T20:31:58.9262166Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3000dacd0>
2025-05-07T20:31:58.9263567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.9265304Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3000b8400>}
2025-05-07T20:31:58.9266995Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.9268270Z context = <triton._C.libtriton.ir.context object at 0x7fd3000d3330>
2025-05-07T20:31:58.9268607Z 
2025-05-07T20:31:58.9268794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.9269414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.9269968Z                            module_map=module_map)
2025-05-07T20:31:58.9270385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.9270784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.9271073Z E       ^
2025-05-07T20:31:58.9271617Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.9272174Z 
2025-05-07T20:31:58.9272686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.9273330Z 
2025-05-07T20:31:58.9273440Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.9273917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.9274384Z     T=4096,
2025-05-07T20:31:58.9274579Z     D=5120,
2025-05-07T20:31:58.9274784Z     scale_ub=1200.0,
2025-05-07T20:31:58.9275112Z     contiguous=False,
2025-05-07T20:31:58.9275338Z     compiled=True,
2025-05-07T20:31:58.9275546Z )
2025-05-07T20:31:58.9275873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.9276370Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.9276650Z 
2025-05-07T20:31:58.9276729Z     @given(
2025-05-07T20:31:58.9276961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.9277270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.9277579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.9277906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.9278233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.9278519Z     )
2025-05-07T20:31:58.9278870Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.9279314Z     def test_silu_mul_quant(
2025-05-07T20:31:58.9279608Z         self,
2025-05-07T20:31:58.9279811Z         T: int,
2025-05-07T20:31:58.9280012Z         D: int,
2025-05-07T20:31:58.9280227Z         scale_ub: Optional[float],
2025-05-07T20:31:58.9280501Z         contiguous: bool,
2025-05-07T20:31:58.9280743Z         compiled: bool,
2025-05-07T20:31:58.9280963Z     ) -> None:
2025-05-07T20:31:58.9281185Z         torch.manual_seed(2025)
2025-05-07T20:31:58.9281432Z     
2025-05-07T20:31:58.9281703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.9282048Z     
2025-05-07T20:31:58.9282242Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.9282541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.9282852Z         x = x_sign * x_clamp
2025-05-07T20:31:58.9283090Z         x0 = x[:, :D]
2025-05-07T20:31:58.9283424Z         x1 = x[:, D:]
2025-05-07T20:31:58.9283630Z     
2025-05-07T20:31:58.9283818Z         if contiguous:
2025-05-07T20:31:58.9284055Z             x0 = x0.contiguous()
2025-05-07T20:31:58.9284377Z             x1 = x1.contiguous()
2025-05-07T20:31:58.9284624Z     
2025-05-07T20:31:58.9284823Z         if scale_ub is not None:
2025-05-07T20:31:58.9285094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.9285434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.9285749Z             )
2025-05-07T20:31:58.9285939Z         else:
2025-05-07T20:31:58.9286152Z             scale_ub_tensor = None
2025-05-07T20:31:58.9286405Z     
2025-05-07T20:31:58.9286636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.9286952Z             op = silu_mul_quant
2025-05-07T20:31:58.9287204Z             if compiled:
2025-05-07T20:31:58.9287455Z                 op = torch.compile(op)
2025-05-07T20:31:58.9287749Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.9288028Z     
2025-05-07T20:31:58.9288227Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.9288391Z 
2025-05-07T20:31:58.9288490Z moe/activation_test.py:117: 
2025-05-07T20:31:58.9288795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.9289130Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.9289413Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.9289981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.9290547Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.9291215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.9291904Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.9292447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.9293139Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.9293894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.9294445Z     kernel = self.compile(
2025-05-07T20:31:58.9294993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.9295655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.9296053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.9296295Z 
2025-05-07T20:31:58.9296530Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3005c1790>
2025-05-07T20:31:58.9297873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.9299607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3000b9620>}
2025-05-07T20:31:58.9301030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.9302061Z context = <triton._C.libtriton.ir.context object at 0x7fd3005b9df0>
2025-05-07T20:31:58.9302349Z 
2025-05-07T20:31:58.9302523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.9303048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.9303516Z                            module_map=module_map)
2025-05-07T20:31:58.9303884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.9304243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.9304501Z E       ^
2025-05-07T20:31:58.9304972Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.9305473Z 
2025-05-07T20:31:58.9305902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.9306422Z 
2025-05-07T20:31:59.0185985Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.0187716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.0188582Z     T=2048,
2025-05-07T20:31:59.0188958Z     D=7168,
2025-05-07T20:31:59.0189357Z     scale_ub=1200.0,
2025-05-07T20:31:59.0189812Z     contiguous=False,
2025-05-07T20:31:59.0190256Z     compiled=False,
2025-05-07T20:31:59.0190671Z )
2025-05-07T20:31:59.0191310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.0192376Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:59.0192927Z 
2025-05-07T20:31:59.0193017Z     @given(
2025-05-07T20:31:59.0193279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.0193622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.0193938Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.0194267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.0194608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.0194905Z     )
2025-05-07T20:31:59.0195267Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.0195713Z     def test_silu_mul_quant(
2025-05-07T20:31:59.0195967Z         self,
2025-05-07T20:31:59.0196174Z         T: int,
2025-05-07T20:31:59.0196375Z         D: int,
2025-05-07T20:31:59.0196605Z         scale_ub: Optional[float],
2025-05-07T20:31:59.0196889Z         contiguous: bool,
2025-05-07T20:31:59.0197129Z         compiled: bool,
2025-05-07T20:31:59.0197365Z     ) -> None:
2025-05-07T20:31:59.0197597Z         torch.manual_seed(2025)
2025-05-07T20:31:59.0197838Z     
2025-05-07T20:31:59.0198477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.0198844Z     
2025-05-07T20:31:59.0199041Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.0199346Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.0199669Z         x = x_sign * x_clamp
2025-05-07T20:31:59.0199912Z         x0 = x[:, :D]
2025-05-07T20:31:59.0200137Z         x1 = x[:, D:]
2025-05-07T20:31:59.0200355Z     
2025-05-07T20:31:59.0200541Z         if contiguous:
2025-05-07T20:31:59.0200784Z             x0 = x0.contiguous()
2025-05-07T20:31:59.0201057Z             x1 = x1.contiguous()
2025-05-07T20:31:59.0201307Z     
2025-05-07T20:31:59.0201500Z         if scale_ub is not None:
2025-05-07T20:31:59.0201782Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.0202131Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.0202442Z             )
2025-05-07T20:31:59.0202641Z         else:
2025-05-07T20:31:59.0202864Z             scale_ub_tensor = None
2025-05-07T20:31:59.0203346Z     
2025-05-07T20:31:59.0203598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.0203918Z             op = silu_mul_quant
2025-05-07T20:31:59.0204168Z             if compiled:
2025-05-07T20:31:59.0204423Z                 op = torch.compile(op)
2025-05-07T20:31:59.0204728Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.0205007Z     
2025-05-07T20:31:59.0205210Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.0205386Z 
2025-05-07T20:31:59.0205488Z moe/activation_test.py:117: 
2025-05-07T20:31:59.0205795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0206127Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.0206418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.0207128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.0207830Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.0208479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.0209175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.0209858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.0210395Z     kernel = self.compile(
2025-05-07T20:31:59.0210953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.0211629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.0212028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0212269Z 
2025-05-07T20:31:59.0212480Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30059abd0>
2025-05-07T20:31:59.0213634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.0215109Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3000ba480>}
2025-05-07T20:31:59.0216476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.0217519Z context = <triton._C.libtriton.ir.context object at 0x7fd3005431b0>
2025-05-07T20:31:59.0217813Z 
2025-05-07T20:31:59.0217984Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.0218527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.0219098Z                            module_map=module_map)
2025-05-07T20:31:59.0219472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.0219840Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.0220111Z E       ^
2025-05-07T20:31:59.0220592Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.0221049Z 
2025-05-07T20:31:59.0221473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.0222000Z 
2025-05-07T20:31:59.0222107Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.0222533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.0222948Z     T=1,
2025-05-07T20:31:59.0223137Z     D=7168,
2025-05-07T20:31:59.0223339Z     scale_ub=None,
2025-05-07T20:31:59.0223564Z     contiguous=True,
2025-05-07T20:31:59.0223789Z     compiled=False,
2025-05-07T20:31:59.0224051Z )
2025-05-07T20:31:59.0224393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.0224881Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:59.0225152Z 
2025-05-07T20:31:59.0225233Z     @given(
2025-05-07T20:31:59.0225470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.0225787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.0226103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.0226438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.0226774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.0227061Z     )
2025-05-07T20:31:59.0227421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.0227873Z     def test_silu_mul_quant(
2025-05-07T20:31:59.0228112Z         self,
2025-05-07T20:31:59.0228317Z         T: int,
2025-05-07T20:31:59.0228522Z         D: int,
2025-05-07T20:31:59.0228747Z         scale_ub: Optional[float],
2025-05-07T20:31:59.0229083Z         contiguous: bool,
2025-05-07T20:31:59.0229331Z         compiled: bool,
2025-05-07T20:31:59.0229554Z     ) -> None:
2025-05-07T20:31:59.0229783Z         torch.manual_seed(2025)
2025-05-07T20:31:59.0230035Z     
2025-05-07T20:31:59.0230310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.0230661Z     
2025-05-07T20:31:59.0230867Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.0231162Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.0231483Z         x = x_sign * x_clamp
2025-05-07T20:31:59.0231733Z         x0 = x[:, :D]
2025-05-07T20:31:59.0231959Z         x1 = x[:, D:]
2025-05-07T20:31:59.0232169Z     
2025-05-07T20:31:59.0232368Z         if contiguous:
2025-05-07T20:31:59.0232607Z             x0 = x0.contiguous()
2025-05-07T20:31:59.0232868Z             x1 = x1.contiguous()
2025-05-07T20:31:59.0233119Z     
2025-05-07T20:31:59.0233329Z         if scale_ub is not None:
2025-05-07T20:31:59.0233607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.0233954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.0234274Z             )
2025-05-07T20:31:59.0234467Z         else:
2025-05-07T20:31:59.0234686Z             scale_ub_tensor = None
2025-05-07T20:31:59.0234951Z     
2025-05-07T20:31:59.0235186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.0235511Z             op = silu_mul_quant
2025-05-07T20:31:59.0235769Z             if compiled:
2025-05-07T20:31:59.0236016Z                 op = torch.compile(op)
2025-05-07T20:31:59.0236324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.0236606Z     
2025-05-07T20:31:59.0236803Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.0236979Z 
2025-05-07T20:31:59.0237079Z moe/activation_test.py:117: 
2025-05-07T20:31:59.0237381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0237817Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.0238105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.0239094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.0239809Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.0240350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.0241046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.0241727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.0242271Z     kernel = self.compile(
2025-05-07T20:31:59.0242824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.0243654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.0244138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0244372Z 
2025-05-07T20:31:59.0244594Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfe0e2d0>
2025-05-07T20:31:59.0245678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.0247058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3000b9da0>}
2025-05-07T20:31:59.0248421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.0249471Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfefa930>
2025-05-07T20:31:59.0249861Z 
2025-05-07T20:31:59.0250032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.0250572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.0251052Z                            module_map=module_map)
2025-05-07T20:31:59.0251431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.0251782Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.0252055Z E       ^
2025-05-07T20:31:59.0252532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.0252987Z 
2025-05-07T20:31:59.0253421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.0253944Z 
2025-05-07T20:31:59.0254052Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.0254483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.0254924Z     T=16384,
2025-05-07T20:31:59.0255119Z     D=7168,
2025-05-07T20:31:59.0255323Z     scale_ub=1200.0,
2025-05-07T20:31:59.0255558Z     contiguous=False,
2025-05-07T20:31:59.0255785Z     compiled=True,
2025-05-07T20:31:59.4121236Z )
2025-05-07T20:31:59.4122251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.4122831Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:59.4123119Z 
2025-05-07T20:31:59.4123322Z     @given(
2025-05-07T20:31:59.4123629Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.4124091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.4124438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.4124785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.4125118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.4125850Z     )
2025-05-07T20:31:59.4126216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.4126662Z     def test_silu_mul_quant(
2025-05-07T20:31:59.4126914Z         self,
2025-05-07T20:31:59.4127119Z         T: int,
2025-05-07T20:31:59.4138158Z         D: int,
2025-05-07T20:31:59.4138696Z         scale_ub: Optional[float],
2025-05-07T20:31:59.4138989Z         contiguous: bool,
2025-05-07T20:31:59.4139231Z         compiled: bool,
2025-05-07T20:31:59.4139473Z     ) -> None:
2025-05-07T20:31:59.4139704Z         torch.manual_seed(2025)
2025-05-07T20:31:59.4139946Z     
2025-05-07T20:31:59.4140231Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.4140579Z     
2025-05-07T20:31:59.4140775Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.4141076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.4141393Z         x = x_sign * x_clamp
2025-05-07T20:31:59.4141822Z         x0 = x[:, :D]
2025-05-07T20:31:59.4142046Z         x1 = x[:, D:]
2025-05-07T20:31:59.4142265Z     
2025-05-07T20:31:59.4142462Z         if contiguous:
2025-05-07T20:31:59.4142693Z             x0 = x0.contiguous()
2025-05-07T20:31:59.4142958Z             x1 = x1.contiguous()
2025-05-07T20:31:59.4143209Z     
2025-05-07T20:31:59.4143400Z         if scale_ub is not None:
2025-05-07T20:31:59.4143679Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.4144025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.4144334Z             )
2025-05-07T20:31:59.4144536Z         else:
2025-05-07T20:31:59.4144760Z             scale_ub_tensor = None
2025-05-07T20:31:59.4145014Z     
2025-05-07T20:31:59.4145261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.4145583Z             op = silu_mul_quant
2025-05-07T20:31:59.4145831Z             if compiled:
2025-05-07T20:31:59.4146091Z                 op = torch.compile(op)
2025-05-07T20:31:59.4146402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4146771Z     
2025-05-07T20:31:59.4146974Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.4147149Z 
2025-05-07T20:31:59.4147250Z moe/activation_test.py:117: 
2025-05-07T20:31:59.4147555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4147885Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.4148175Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4148748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.4149314Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.4149985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.4150686Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.4151242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.4151932Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.4152616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.4153164Z     kernel = self.compile(
2025-05-07T20:31:59.4153746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.4154419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4154819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4155059Z 
2025-05-07T20:31:59.4155271Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfe31010>
2025-05-07T20:31:59.4156511Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.4157917Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfe1ca40>}
2025-05-07T20:31:59.4159267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.4160306Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfe112b0>
2025-05-07T20:31:59.4160609Z 
2025-05-07T20:31:59.4160781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.4161314Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4161782Z                            module_map=module_map)
2025-05-07T20:31:59.4162160Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4162571Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4162841Z E       ^
2025-05-07T20:31:59.4163507Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.4163969Z 
2025-05-07T20:31:59.4164391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.4164909Z 
2025-05-07T20:31:59.4165022Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.4165435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.4165844Z     T=1,
2025-05-07T20:31:59.4166035Z     D=7168,
2025-05-07T20:31:59.4166238Z     scale_ub=None,
2025-05-07T20:31:59.4166451Z     contiguous=False,
2025-05-07T20:31:59.4166680Z     compiled=False,
2025-05-07T20:31:59.4166896Z )
2025-05-07T20:31:59.4167215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.4167717Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:59.4168031Z 
2025-05-07T20:31:59.4168119Z     @given(
2025-05-07T20:31:59.4168344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.4168664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.4168976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.4169299Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.4169631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.4169926Z     )
2025-05-07T20:31:59.4170282Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.4170720Z     def test_silu_mul_quant(
2025-05-07T20:31:59.4170967Z         self,
2025-05-07T20:31:59.4171171Z         T: int,
2025-05-07T20:31:59.4171365Z         D: int,
2025-05-07T20:31:59.4171588Z         scale_ub: Optional[float],
2025-05-07T20:31:59.4171864Z         contiguous: bool,
2025-05-07T20:31:59.4172110Z         compiled: bool,
2025-05-07T20:31:59.4172346Z     ) -> None:
2025-05-07T20:31:59.4172566Z         torch.manual_seed(2025)
2025-05-07T20:31:59.4172802Z     
2025-05-07T20:31:59.4173084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.4173432Z     
2025-05-07T20:31:59.4173624Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.4173924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.4174240Z         x = x_sign * x_clamp
2025-05-07T20:31:59.4174485Z         x0 = x[:, :D]
2025-05-07T20:31:59.4174700Z         x1 = x[:, D:]
2025-05-07T20:31:59.4174913Z     
2025-05-07T20:31:59.4175107Z         if contiguous:
2025-05-07T20:31:59.4175338Z             x0 = x0.contiguous()
2025-05-07T20:31:59.4175602Z             x1 = x1.contiguous()
2025-05-07T20:31:59.4175850Z     
2025-05-07T20:31:59.4176047Z         if scale_ub is not None:
2025-05-07T20:31:59.4176322Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.4176775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.4177086Z             )
2025-05-07T20:31:59.4177287Z         else:
2025-05-07T20:31:59.4177502Z             scale_ub_tensor = None
2025-05-07T20:31:59.4177750Z     
2025-05-07T20:31:59.4177987Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.4178307Z             op = silu_mul_quant
2025-05-07T20:31:59.4178553Z             if compiled:
2025-05-07T20:31:59.4178806Z                 op = torch.compile(op)
2025-05-07T20:31:59.4179103Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4179372Z     
2025-05-07T20:31:59.4179571Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.4179741Z 
2025-05-07T20:31:59.4179840Z moe/activation_test.py:117: 
2025-05-07T20:31:59.4180138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4180465Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.4180749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4181512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.4182203Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.4182750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.4183440Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.4184110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.4184643Z     kernel = self.compile(
2025-05-07T20:31:59.4185190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.4185859Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4186255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4186501Z 
2025-05-07T20:31:59.4186755Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfe7e750>
2025-05-07T20:31:59.4187844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.4189221Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfe1d8a0>}
2025-05-07T20:31:59.4190574Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.4191595Z context = <triton._C.libtriton.ir.context object at 0x7fd1bff4adb0>
2025-05-07T20:31:59.4191894Z 
2025-05-07T20:31:59.4192068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.4192607Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4193085Z                            module_map=module_map)
2025-05-07T20:31:59.4193500Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4193862Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4194127Z E       ^
2025-05-07T20:31:59.4194592Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.4195054Z 
2025-05-07T20:31:59.4195475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.4196001Z 
2025-05-07T20:31:59.4196108Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.4196527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.4196931Z     T=2048,
2025-05-07T20:31:59.4197218Z     D=7168,
2025-05-07T20:31:59.4197424Z     scale_ub=None,
2025-05-07T20:31:59.4197636Z     contiguous=False,
2025-05-07T20:31:59.4197867Z     compiled=True,
2025-05-07T20:31:59.4198078Z )
2025-05-07T20:31:59.4879374Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.4879929Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:59.4880220Z 
2025-05-07T20:31:59.4880303Z     @given(
2025-05-07T20:31:59.4880547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.4880865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.4881181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.4881524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.4881862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.4882150Z     )
2025-05-07T20:31:59.4882530Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.4883453Z     def test_silu_mul_quant(
2025-05-07T20:31:59.4883722Z         self,
2025-05-07T20:31:59.4883922Z         T: int,
2025-05-07T20:31:59.4884137Z         D: int,
2025-05-07T20:31:59.4884367Z         scale_ub: Optional[float],
2025-05-07T20:31:59.4884644Z         contiguous: bool,
2025-05-07T20:31:59.4884900Z         compiled: bool,
2025-05-07T20:31:59.4885138Z     ) -> None:
2025-05-07T20:31:59.4885358Z         torch.manual_seed(2025)
2025-05-07T20:31:59.4885609Z     
2025-05-07T20:31:59.4885894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.4886241Z     
2025-05-07T20:31:59.4886449Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.4886755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.4887066Z         x = x_sign * x_clamp
2025-05-07T20:31:59.4887316Z         x0 = x[:, :D]
2025-05-07T20:31:59.4887552Z         x1 = x[:, D:]
2025-05-07T20:31:59.4887767Z     
2025-05-07T20:31:59.4887979Z         if contiguous:
2025-05-07T20:31:59.4888327Z             x0 = x0.contiguous()
2025-05-07T20:31:59.4888587Z             x1 = x1.contiguous()
2025-05-07T20:31:59.4888844Z     
2025-05-07T20:31:59.4889052Z         if scale_ub is not None:
2025-05-07T20:31:59.4889339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.4889677Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.4890002Z             )
2025-05-07T20:31:59.4890212Z         else:
2025-05-07T20:31:59.4890433Z             scale_ub_tensor = None
2025-05-07T20:31:59.4890695Z     
2025-05-07T20:31:59.4890932Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.4891243Z             op = silu_mul_quant
2025-05-07T20:31:59.4891500Z             if compiled:
2025-05-07T20:31:59.4891755Z                 op = torch.compile(op)
2025-05-07T20:31:59.4892051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4892334Z     
2025-05-07T20:31:59.4892530Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.4892705Z 
2025-05-07T20:31:59.4892809Z moe/activation_test.py:117: 
2025-05-07T20:31:59.4893115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4893481Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.4893771Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4894339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.4894910Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.4895583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.4896281Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.4896825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.4897520Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.4898379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.4898935Z     kernel = self.compile(
2025-05-07T20:31:59.4899484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.4900157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4900570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4900800Z 
2025-05-07T20:31:59.4901027Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bff41350>
2025-05-07T20:31:59.4902106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.4903507Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfe1eb60>}
2025-05-07T20:31:59.4904965Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.4906003Z context = <triton._C.libtriton.ir.context object at 0x7fd1bff299b0>
2025-05-07T20:31:59.4906297Z 
2025-05-07T20:31:59.4906473Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.4907008Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4907482Z                            module_map=module_map)
2025-05-07T20:31:59.4907860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4908222Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4908497Z E       ^
2025-05-07T20:31:59.4908983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.4909484Z 
2025-05-07T20:31:59.4909918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.4910438Z 
2025-05-07T20:31:59.4910550Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.4910970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.4911386Z     T=4096,
2025-05-07T20:31:59.4911576Z     D=7168,
2025-05-07T20:31:59.4911777Z     scale_ub=None,
2025-05-07T20:31:59.4912007Z     contiguous=False,
2025-05-07T20:31:59.4912233Z     compiled=True,
2025-05-07T20:31:59.4912451Z )
2025-05-07T20:31:59.4912780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.4913275Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:59.4913560Z 
2025-05-07T20:31:59.4913660Z     @given(
2025-05-07T20:31:59.4913930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.4914249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.4914555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.4914890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.4915224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.4915507Z     )
2025-05-07T20:31:59.4915861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.4916312Z     def test_silu_mul_quant(
2025-05-07T20:31:59.4916552Z         self,
2025-05-07T20:31:59.4916752Z         T: int,
2025-05-07T20:31:59.4916954Z         D: int,
2025-05-07T20:31:59.4917171Z         scale_ub: Optional[float],
2025-05-07T20:31:59.4917451Z         contiguous: bool,
2025-05-07T20:31:59.4917696Z         compiled: bool,
2025-05-07T20:31:59.4917928Z     ) -> None:
2025-05-07T20:31:59.4918144Z         torch.manual_seed(2025)
2025-05-07T20:31:59.4918918Z     
2025-05-07T20:31:59.4919207Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.4919550Z     
2025-05-07T20:31:59.4919750Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.4920053Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.4920366Z         x = x_sign * x_clamp
2025-05-07T20:31:59.4920613Z         x0 = x[:, :D]
2025-05-07T20:31:59.4920840Z         x1 = x[:, D:]
2025-05-07T20:31:59.4921049Z     
2025-05-07T20:31:59.4921245Z         if contiguous:
2025-05-07T20:31:59.4921488Z             x0 = x0.contiguous()
2025-05-07T20:31:59.4921743Z             x1 = x1.contiguous()
2025-05-07T20:31:59.4921994Z     
2025-05-07T20:31:59.4922196Z         if scale_ub is not None:
2025-05-07T20:31:59.4922467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.4922814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.4923133Z             )
2025-05-07T20:31:59.4923513Z         else:
2025-05-07T20:31:59.4923739Z             scale_ub_tensor = None
2025-05-07T20:31:59.4924001Z     
2025-05-07T20:31:59.4924240Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.4924556Z             op = silu_mul_quant
2025-05-07T20:31:59.4924819Z             if compiled:
2025-05-07T20:31:59.4925070Z                 op = torch.compile(op)
2025-05-07T20:31:59.4925365Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4925645Z     
2025-05-07T20:31:59.4925846Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.4926014Z 
2025-05-07T20:31:59.4926116Z moe/activation_test.py:117: 
2025-05-07T20:31:59.4926416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4926751Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.4927033Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4927603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.4928177Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.4928901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.4929598Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.4930147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.4930846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.4931525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.4932063Z     kernel = self.compile(
2025-05-07T20:31:59.4932621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.4933292Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4933696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4933942Z 
2025-05-07T20:31:59.4934155Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bffee910>
2025-05-07T20:31:59.4935249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.4936632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfe1fe20>}
2025-05-07T20:31:59.4937989Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.4939261Z context = <triton._C.libtriton.ir.context object at 0x7fd1bff0af70>
2025-05-07T20:31:59.4939701Z 
2025-05-07T20:31:59.4939878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.4940406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4940880Z                            module_map=module_map)
2025-05-07T20:31:59.4941243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4941607Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4941872Z E       ^
2025-05-07T20:31:59.4942339Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.4942800Z 
2025-05-07T20:31:59.4943222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.4943748Z 
2025-05-07T20:31:59.6205870Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.6206364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.6207069Z     T=16384,
2025-05-07T20:31:59.6207290Z     D=5120,
2025-05-07T20:31:59.6207496Z     scale_ub=1200.0,
2025-05-07T20:31:59.6207741Z     contiguous=False,
2025-05-07T20:31:59.6208046Z     compiled=False,
2025-05-07T20:31:59.6208344Z )
2025-05-07T20:31:59.6208723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.6209236Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:59.6209536Z 
2025-05-07T20:31:59.6209621Z     @given(
2025-05-07T20:31:59.6209871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.6210192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.6210516Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.6210861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.6211203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.6211499Z     )
2025-05-07T20:31:59.6211869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.6212440Z     def test_silu_mul_quant(
2025-05-07T20:31:59.6212683Z         self,
2025-05-07T20:31:59.6212894Z         T: int,
2025-05-07T20:31:59.6213109Z         D: int,
2025-05-07T20:31:59.6213332Z         scale_ub: Optional[float],
2025-05-07T20:31:59.6213666Z         contiguous: bool,
2025-05-07T20:31:59.6213918Z         compiled: bool,
2025-05-07T20:31:59.6214151Z     ) -> None:
2025-05-07T20:31:59.6214380Z         torch.manual_seed(2025)
2025-05-07T20:31:59.6214633Z     
2025-05-07T20:31:59.6214915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.6215272Z     
2025-05-07T20:31:59.6215472Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.6215767Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.6216075Z         x = x_sign * x_clamp
2025-05-07T20:31:59.6216323Z         x0 = x[:, :D]
2025-05-07T20:31:59.6216544Z         x1 = x[:, D:]
2025-05-07T20:31:59.6216764Z     
2025-05-07T20:31:59.6216957Z         if contiguous:
2025-05-07T20:31:59.6217195Z             x0 = x0.contiguous()
2025-05-07T20:31:59.6217451Z             x1 = x1.contiguous()
2025-05-07T20:31:59.6217699Z     
2025-05-07T20:31:59.6217893Z         if scale_ub is not None:
2025-05-07T20:31:59.6218183Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.6218526Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.6218846Z             )
2025-05-07T20:31:59.6219042Z         else:
2025-05-07T20:31:59.6219259Z             scale_ub_tensor = None
2025-05-07T20:31:59.6219521Z     
2025-05-07T20:31:59.6219757Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.6220084Z             op = silu_mul_quant
2025-05-07T20:31:59.6220349Z             if compiled:
2025-05-07T20:31:59.6220595Z                 op = torch.compile(op)
2025-05-07T20:31:59.6220898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.6221339Z     
2025-05-07T20:31:59.6221545Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.6221712Z 
2025-05-07T20:31:59.6221813Z moe/activation_test.py:117: 
2025-05-07T20:31:59.6222115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.6222456Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.6222740Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.6223443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.6224168Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.6224713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.6235289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.6236014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.6236677Z     kernel = self.compile(
2025-05-07T20:31:59.6237247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.6237918Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.6238317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.6238866Z 
2025-05-07T20:31:59.6239083Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfb43e10>
2025-05-07T20:31:59.6240168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.6241552Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfb38d60>}
2025-05-07T20:31:59.6243002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.6244124Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfb80470>
2025-05-07T20:31:59.6244421Z 
2025-05-07T20:31:59.6244594Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.6245121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.6245588Z                            module_map=module_map)
2025-05-07T20:31:59.6245964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.6246330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.6246586Z E       ^
2025-05-07T20:31:59.6247061Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.6247525Z 
2025-05-07T20:31:59.6247951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.6248467Z 
2025-05-07T20:31:59.6248578Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.6248988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.6249398Z     T=16384,
2025-05-07T20:31:59.6249595Z     D=5120,
2025-05-07T20:31:59.6249785Z     scale_ub=1200.0,
2025-05-07T20:31:59.6250016Z     contiguous=True,
2025-05-07T20:31:59.6250239Z     compiled=True,
2025-05-07T20:31:59.6250442Z )
2025-05-07T20:31:59.6250769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.6251266Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:59.6251543Z 
2025-05-07T20:31:59.6251626Z     @given(
2025-05-07T20:31:59.6251849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.6252296Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.6252616Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.6252944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.6253331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.6253626Z     )
2025-05-07T20:31:59.6253971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.6254420Z     def test_silu_mul_quant(
2025-05-07T20:31:59.6254669Z         self,
2025-05-07T20:31:59.6254869Z         T: int,
2025-05-07T20:31:59.6255066Z         D: int,
2025-05-07T20:31:59.6255290Z         scale_ub: Optional[float],
2025-05-07T20:31:59.6255567Z         contiguous: bool,
2025-05-07T20:31:59.6255806Z         compiled: bool,
2025-05-07T20:31:59.6256037Z     ) -> None:
2025-05-07T20:31:59.6256259Z         torch.manual_seed(2025)
2025-05-07T20:31:59.6256495Z     
2025-05-07T20:31:59.6256783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.6257203Z     
2025-05-07T20:31:59.6257394Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.6257696Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.6258011Z         x = x_sign * x_clamp
2025-05-07T20:31:59.6258252Z         x0 = x[:, :D]
2025-05-07T20:31:59.6258482Z         x1 = x[:, D:]
2025-05-07T20:31:59.6258706Z     
2025-05-07T20:31:59.6258897Z         if contiguous:
2025-05-07T20:31:59.6259135Z             x0 = x0.contiguous()
2025-05-07T20:31:59.6259403Z             x1 = x1.contiguous()
2025-05-07T20:31:59.6259638Z     
2025-05-07T20:31:59.6259837Z         if scale_ub is not None:
2025-05-07T20:31:59.6260115Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.6260444Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.6260758Z             )
2025-05-07T20:31:59.6260955Z         else:
2025-05-07T20:31:59.6261161Z             scale_ub_tensor = None
2025-05-07T20:31:59.6261418Z     
2025-05-07T20:31:59.6261658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.6262014Z             op = silu_mul_quant
2025-05-07T20:31:59.6262261Z             if compiled:
2025-05-07T20:31:59.6262506Z                 op = torch.compile(op)
2025-05-07T20:31:59.6262798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.6263079Z     
2025-05-07T20:31:59.6263273Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.6263435Z 
2025-05-07T20:31:59.6263534Z moe/activation_test.py:117: 
2025-05-07T20:31:59.6263830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.6264196Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.6264495Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.6265053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.6265616Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.6266282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.6266974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.6267514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.6268196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.6268870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.6269399Z     kernel = self.compile(
2025-05-07T20:31:59.6269946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.6270608Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.6271003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.6271231Z 
2025-05-07T20:31:59.6271531Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfb4ea50>
2025-05-07T20:31:59.6272617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.6273985Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfb3a200>}
2025-05-07T20:31:59.6275335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.6276354Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfbe30b0>
2025-05-07T20:31:59.6276652Z 
2025-05-07T20:31:59.6276819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.6277391Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.6277861Z                            module_map=module_map)
2025-05-07T20:31:59.6278220Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.6278581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.6278843Z E       ^
2025-05-07T20:31:59.6279306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.6279768Z 
2025-05-07T20:31:59.6280187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.6280716Z 
2025-05-07T20:31:59.7586541Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.7587016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.7587427Z     T=16384,
2025-05-07T20:31:59.7587631Z     D=5120,
2025-05-07T20:31:59.7587876Z     scale_ub=None,
2025-05-07T20:31:59.7588409Z     contiguous=False,
2025-05-07T20:31:59.7588636Z     compiled=True,
2025-05-07T20:31:59.7588866Z )
2025-05-07T20:31:59.7589194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.7589694Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:59.7589973Z 
2025-05-07T20:31:59.7590053Z     @given(
2025-05-07T20:31:59.7590291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.7590608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.7590908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.7591241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.7591569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.7591848Z     )
2025-05-07T20:31:59.7592200Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.7592648Z     def test_silu_mul_quant(
2025-05-07T20:31:59.7592896Z         self,
2025-05-07T20:31:59.7593089Z         T: int,
2025-05-07T20:31:59.7593288Z         D: int,
2025-05-07T20:31:59.7593511Z         scale_ub: Optional[float],
2025-05-07T20:31:59.7593776Z         contiguous: bool,
2025-05-07T20:31:59.7594019Z         compiled: bool,
2025-05-07T20:31:59.7594250Z     ) -> None:
2025-05-07T20:31:59.7594467Z         torch.manual_seed(2025)
2025-05-07T20:31:59.7594711Z     
2025-05-07T20:31:59.7594987Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.7595324Z     
2025-05-07T20:31:59.7595521Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.7595813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.7596120Z         x = x_sign * x_clamp
2025-05-07T20:31:59.7596364Z         x0 = x[:, :D]
2025-05-07T20:31:59.7596583Z         x1 = x[:, D:]
2025-05-07T20:31:59.7596784Z     
2025-05-07T20:31:59.7596976Z         if contiguous:
2025-05-07T20:31:59.7597364Z             x0 = x0.contiguous()
2025-05-07T20:31:59.7597631Z             x1 = x1.contiguous()
2025-05-07T20:31:59.7597875Z     
2025-05-07T20:31:59.7598074Z         if scale_ub is not None:
2025-05-07T20:31:59.7598353Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.7598685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.7598998Z             )
2025-05-07T20:31:59.7599196Z         else:
2025-05-07T20:31:59.7599400Z             scale_ub_tensor = None
2025-05-07T20:31:59.7599654Z     
2025-05-07T20:31:59.7599890Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.7600202Z             op = silu_mul_quant
2025-05-07T20:31:59.7600455Z             if compiled:
2025-05-07T20:31:59.7600709Z                 op = torch.compile(op)
2025-05-07T20:31:59.7601003Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.7601279Z     
2025-05-07T20:31:59.7601476Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.7601638Z 
2025-05-07T20:31:59.7601834Z moe/activation_test.py:117: 
2025-05-07T20:31:59.7602134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.7602473Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.7602765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.7603424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.7603994Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.7604658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.7605340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.7605879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.7606575Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.7607248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.7607828Z     kernel = self.compile(
2025-05-07T20:31:59.7608372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.7609055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.7609448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.7609686Z 
2025-05-07T20:31:59.7609899Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfcd3f10>
2025-05-07T20:31:59.7610985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.7612376Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfb3ad40>}
2025-05-07T20:31:59.7613751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.7614809Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfccc530>
2025-05-07T20:31:59.7615102Z 
2025-05-07T20:31:59.7615269Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.7615795Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.7616258Z                            module_map=module_map)
2025-05-07T20:31:59.7616626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.7616984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.7617246Z E       ^
2025-05-07T20:31:59.7617820Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.7618284Z 
2025-05-07T20:31:59.7618704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.7619220Z 
2025-05-07T20:31:59.7619332Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.7619746Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.7620155Z     T=2048,
2025-05-07T20:31:59.7620347Z     D=5120,
2025-05-07T20:31:59.7620543Z     scale_ub=None,
2025-05-07T20:31:59.7620753Z     contiguous=False,
2025-05-07T20:31:59.7620977Z     compiled=True,
2025-05-07T20:31:59.7621179Z )
2025-05-07T20:32:00.0310814Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.0312274Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:00.0312824Z 
2025-05-07T20:32:00.0312987Z     @given(
2025-05-07T20:32:00.0313296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.0313758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.0314070Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.0314396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0314730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0315018Z     )
2025-05-07T20:32:00.0315376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0315817Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0316063Z         self,
2025-05-07T20:32:00.0316267Z         T: int,
2025-05-07T20:32:00.0316464Z         D: int,
2025-05-07T20:32:00.0316688Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0316965Z         contiguous: bool,
2025-05-07T20:32:00.0317204Z         compiled: bool,
2025-05-07T20:32:00.0317433Z     ) -> None:
2025-05-07T20:32:00.0317654Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0317896Z     
2025-05-07T20:32:00.0318180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0318612Z     
2025-05-07T20:32:00.0318806Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.0319106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.0319426Z         x = x_sign * x_clamp
2025-05-07T20:32:00.0319662Z         x0 = x[:, :D]
2025-05-07T20:32:00.0319885Z         x1 = x[:, D:]
2025-05-07T20:32:00.0320099Z     
2025-05-07T20:32:00.0320284Z         if contiguous:
2025-05-07T20:32:00.0320521Z             x0 = x0.contiguous()
2025-05-07T20:32:00.0320786Z             x1 = x1.contiguous()
2025-05-07T20:32:00.0321031Z     
2025-05-07T20:32:00.0321224Z         if scale_ub is not None:
2025-05-07T20:32:00.0321501Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.0321840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.0322145Z             )
2025-05-07T20:32:00.0322345Z         else:
2025-05-07T20:32:00.0322562Z             scale_ub_tensor = None
2025-05-07T20:32:00.0322821Z     
2025-05-07T20:32:00.0323063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.0323555Z             op = silu_mul_quant
2025-05-07T20:32:00.0323803Z             if compiled:
2025-05-07T20:32:00.0324055Z                 op = torch.compile(op)
2025-05-07T20:32:00.0324355Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0324627Z     
2025-05-07T20:32:00.0324822Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.0324993Z 
2025-05-07T20:32:00.0325099Z moe/activation_test.py:117: 
2025-05-07T20:32:00.0325403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0325733Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.0326019Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0326593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.0327155Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.0327964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.0328668Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.0329213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.0329894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.0330562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.0331101Z     kernel = self.compile(
2025-05-07T20:32:00.0331644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.0332308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.0332708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0332986Z 
2025-05-07T20:32:00.0333204Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfcfc8d0>
2025-05-07T20:32:00.0334280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.0335661Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfc6c7c0>}
2025-05-07T20:32:00.0337016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.0338051Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfcf5c30>
2025-05-07T20:32:00.0338338Z 
2025-05-07T20:32:00.0338688Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.0339290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.0339765Z                            module_map=module_map)
2025-05-07T20:32:00.0340139Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.0340494Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.0340762Z E       ^
2025-05-07T20:32:00.0341236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.0341690Z 
2025-05-07T20:32:00.0342117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.0342632Z 
2025-05-07T20:32:00.0342739Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.0343160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.0343570Z     T=2048,
2025-05-07T20:32:00.0343761Z     D=5120,
2025-05-07T20:32:00.0343957Z     scale_ub=1200.0,
2025-05-07T20:32:00.0344187Z     contiguous=False,
2025-05-07T20:32:00.0344409Z     compiled=True,
2025-05-07T20:32:00.0344616Z )
2025-05-07T20:32:00.0344943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.0345437Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:00.0345715Z 
2025-05-07T20:32:00.0345793Z     @given(
2025-05-07T20:32:00.0346025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.0346339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.0346640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.0346969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0347302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0347584Z     )
2025-05-07T20:32:00.0348056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0348505Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0348746Z         self,
2025-05-07T20:32:00.0348935Z         T: int,
2025-05-07T20:32:00.0349137Z         D: int,
2025-05-07T20:32:00.0349359Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0349627Z         contiguous: bool,
2025-05-07T20:32:00.0349872Z         compiled: bool,
2025-05-07T20:32:00.0350101Z     ) -> None:
2025-05-07T20:32:00.0350311Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0350554Z     
2025-05-07T20:32:00.0350830Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0351166Z     
2025-05-07T20:32:00.0351363Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.0351655Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.0351959Z         x = x_sign * x_clamp
2025-05-07T20:32:00.0352203Z         x0 = x[:, :D]
2025-05-07T20:32:00.0352425Z         x1 = x[:, D:]
2025-05-07T20:32:00.0352629Z     
2025-05-07T20:32:00.0352891Z         if contiguous:
2025-05-07T20:32:00.0353128Z             x0 = x0.contiguous()
2025-05-07T20:32:00.0353385Z             x1 = x1.contiguous()
2025-05-07T20:32:00.0353628Z     
2025-05-07T20:32:00.0353823Z         if scale_ub is not None:
2025-05-07T20:32:00.0354103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.0354440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.0354759Z             )
2025-05-07T20:32:00.0354960Z         else:
2025-05-07T20:32:00.0355163Z             scale_ub_tensor = None
2025-05-07T20:32:00.0355419Z     
2025-05-07T20:32:00.0355655Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.0355964Z             op = silu_mul_quant
2025-05-07T20:32:00.0356218Z             if compiled:
2025-05-07T20:32:00.0356465Z                 op = torch.compile(op)
2025-05-07T20:32:00.0356759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0357037Z     
2025-05-07T20:32:00.0357234Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.0357452Z 
2025-05-07T20:32:00.0357550Z moe/activation_test.py:117: 
2025-05-07T20:32:00.0357844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0358178Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.0358459Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0359016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.0359579Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.0360243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.0360931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.0361473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.0362160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.0362837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.0363465Z     kernel = self.compile(
2025-05-07T20:32:00.0364013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.0364674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.0365069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0365322Z 
2025-05-07T20:32:00.0365532Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfada990>
2025-05-07T20:32:00.0366616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.0368072Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfc6d300>}
2025-05-07T20:32:00.0369431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.0370461Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfaa2f70>
2025-05-07T20:32:00.0370752Z 
2025-05-07T20:32:00.0370919Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.0371446Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.0371919Z                            module_map=module_map)
2025-05-07T20:32:00.0372280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.0372639Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.0372905Z E       ^
2025-05-07T20:32:00.0373415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.0373882Z 
2025-05-07T20:32:00.0374304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.0374830Z 
2025-05-07T20:32:00.1702808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.1704137Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.1704613Z     T=4096,
2025-05-07T20:32:00.1704815Z     D=5120,
2025-05-07T20:32:00.1705015Z     scale_ub=1200.0,
2025-05-07T20:32:00.1705246Z     contiguous=True,
2025-05-07T20:32:00.1705477Z     compiled=True,
2025-05-07T20:32:00.1705684Z )
2025-05-07T20:32:00.1706021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.1706530Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.1706809Z 
2025-05-07T20:32:00.1706944Z     @given(
2025-05-07T20:32:00.1707588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.1707917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.1708226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.1708566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.1708899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.1709190Z     )
2025-05-07T20:32:00.1720380Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.1720842Z     def test_silu_mul_quant(
2025-05-07T20:32:00.1721098Z         self,
2025-05-07T20:32:00.1721302Z         T: int,
2025-05-07T20:32:00.1721500Z         D: int,
2025-05-07T20:32:00.1721733Z         scale_ub: Optional[float],
2025-05-07T20:32:00.1722016Z         contiguous: bool,
2025-05-07T20:32:00.1722261Z         compiled: bool,
2025-05-07T20:32:00.1722493Z     ) -> None:
2025-05-07T20:32:00.1722733Z         torch.manual_seed(2025)
2025-05-07T20:32:00.1722991Z     
2025-05-07T20:32:00.1723397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.1723760Z     
2025-05-07T20:32:00.1723965Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.1724263Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.1724588Z         x = x_sign * x_clamp
2025-05-07T20:32:00.1724839Z         x0 = x[:, :D]
2025-05-07T20:32:00.1725060Z         x1 = x[:, D:]
2025-05-07T20:32:00.1725277Z     
2025-05-07T20:32:00.1725475Z         if contiguous:
2025-05-07T20:32:00.1725711Z             x0 = x0.contiguous()
2025-05-07T20:32:00.1725981Z             x1 = x1.contiguous()
2025-05-07T20:32:00.1726228Z     
2025-05-07T20:32:00.1726418Z         if scale_ub is not None:
2025-05-07T20:32:00.1726705Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.1727052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.1727365Z             )
2025-05-07T20:32:00.1727815Z         else:
2025-05-07T20:32:00.1728046Z             scale_ub_tensor = None
2025-05-07T20:32:00.1728314Z     
2025-05-07T20:32:00.1728550Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.1728879Z             op = silu_mul_quant
2025-05-07T20:32:00.1729142Z             if compiled:
2025-05-07T20:32:00.1729393Z                 op = torch.compile(op)
2025-05-07T20:32:00.1729700Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.1729984Z     
2025-05-07T20:32:00.1730179Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.1730352Z 
2025-05-07T20:32:00.1730456Z moe/activation_test.py:117: 
2025-05-07T20:32:00.1730761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.1731097Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.1731392Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.1731971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.1732661Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.1733330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.1734034Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.1734582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.1735269Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.1735954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.1736496Z     kernel = self.compile(
2025-05-07T20:32:00.1737053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.1737716Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.1738132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.1738698Z 
2025-05-07T20:32:00.1738922Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfa47ad0>
2025-05-07T20:32:00.1740012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.1741402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfc6dbc0>}
2025-05-07T20:32:00.1742752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.1743792Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfa6c170>
2025-05-07T20:32:00.1744095Z 
2025-05-07T20:32:00.1744273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.1744798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.1745273Z                            module_map=module_map)
2025-05-07T20:32:00.1745646Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.1746005Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.1746265Z E       ^
2025-05-07T20:32:00.1746736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.1747190Z 
2025-05-07T20:32:00.1747621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.1748138Z 
2025-05-07T20:32:00.1748247Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.1748791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.1749209Z     T=128,
2025-05-07T20:32:00.1749396Z     D=5120,
2025-05-07T20:32:00.1749585Z     scale_ub=1200.0,
2025-05-07T20:32:00.1749812Z     contiguous=False,
2025-05-07T20:32:00.1750042Z     compiled=True,
2025-05-07T20:32:00.1750242Z )
2025-05-07T20:32:00.2574235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.2575031Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:00.2575318Z 
2025-05-07T20:32:00.2575395Z     @given(
2025-05-07T20:32:00.2575636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.2575945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.2576258Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.2576594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.2576918Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.2577209Z     )
2025-05-07T20:32:00.2577891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.2578347Z     def test_silu_mul_quant(
2025-05-07T20:32:00.2578584Z         self,
2025-05-07T20:32:00.2578782Z         T: int,
2025-05-07T20:32:00.2579004Z         D: int,
2025-05-07T20:32:00.2579235Z         scale_ub: Optional[float],
2025-05-07T20:32:00.2579537Z         contiguous: bool,
2025-05-07T20:32:00.2579806Z         compiled: bool,
2025-05-07T20:32:00.2580052Z     ) -> None:
2025-05-07T20:32:00.2580286Z         torch.manual_seed(2025)
2025-05-07T20:32:00.2580554Z     
2025-05-07T20:32:00.2580853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.2581255Z     
2025-05-07T20:32:00.2581470Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.2581789Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.2582147Z         x = x_sign * x_clamp
2025-05-07T20:32:00.2582419Z         x0 = x[:, :D]
2025-05-07T20:32:00.2582662Z         x1 = x[:, D:]
2025-05-07T20:32:00.2582992Z     
2025-05-07T20:32:00.2583185Z         if contiguous:
2025-05-07T20:32:00.2583421Z             x0 = x0.contiguous()
2025-05-07T20:32:00.2583687Z             x1 = x1.contiguous()
2025-05-07T20:32:00.2583936Z     
2025-05-07T20:32:00.2584126Z         if scale_ub is not None:
2025-05-07T20:32:00.2584408Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.2584752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.2585069Z             )
2025-05-07T20:32:00.2585260Z         else:
2025-05-07T20:32:00.2585479Z             scale_ub_tensor = None
2025-05-07T20:32:00.2585738Z     
2025-05-07T20:32:00.2585974Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2586292Z             op = silu_mul_quant
2025-05-07T20:32:00.2586543Z             if compiled:
2025-05-07T20:32:00.2586794Z                 op = torch.compile(op)
2025-05-07T20:32:00.2587098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2587385Z     
2025-05-07T20:32:00.2587578Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.2587753Z 
2025-05-07T20:32:00.2587855Z moe/activation_test.py:117: 
2025-05-07T20:32:00.2588159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2588498Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.2588779Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2589346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.2589914Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.2590575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.2591271Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.2591814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.2592673Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.2593348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.2593891Z     kernel = self.compile(
2025-05-07T20:32:00.2594448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.2595109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.2595515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2595754Z 
2025-05-07T20:32:00.2595967Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfae0dd0>
2025-05-07T20:32:00.2597055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.2598960Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfc6eb60>}
2025-05-07T20:32:00.2600311Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.2601348Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfa55430>
2025-05-07T20:32:00.2601640Z 
2025-05-07T20:32:00.2601817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.2602351Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.2602826Z                            module_map=module_map)
2025-05-07T20:32:00.2603201Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.2603748Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.2604065Z E       ^
2025-05-07T20:32:00.2604539Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2604994Z 
2025-05-07T20:32:00.2605430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.2605955Z 
2025-05-07T20:32:00.2606069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.2606488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.2606899Z     T=16384,
2025-05-07T20:32:00.2607099Z     D=7168,
2025-05-07T20:32:00.2607298Z     scale_ub=1200.0,
2025-05-07T20:32:00.2607529Z     contiguous=True,
2025-05-07T20:32:00.2607765Z     compiled=True,
2025-05-07T20:32:00.2607972Z )
2025-05-07T20:32:00.2608299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.2608818Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.2609103Z 
2025-05-07T20:32:00.2609192Z     @given(
2025-05-07T20:32:00.2609423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.2609746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.2610060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.2610386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.2610717Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.2611011Z     )
2025-05-07T20:32:00.2611358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.2611811Z     def test_silu_mul_quant(
2025-05-07T20:32:00.2612054Z         self,
2025-05-07T20:32:00.2612251Z         T: int,
2025-05-07T20:32:00.2612477Z         D: int,
2025-05-07T20:32:00.2612710Z         scale_ub: Optional[float],
2025-05-07T20:32:00.2612982Z         contiguous: bool,
2025-05-07T20:32:00.2613233Z         compiled: bool,
2025-05-07T20:32:00.2613557Z     ) -> None:
2025-05-07T20:32:00.2613786Z         torch.manual_seed(2025)
2025-05-07T20:32:00.2614031Z     
2025-05-07T20:32:00.2614311Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.2614653Z     
2025-05-07T20:32:00.2614858Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.2615160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.2615470Z         x = x_sign * x_clamp
2025-05-07T20:32:00.2615714Z         x0 = x[:, :D]
2025-05-07T20:32:00.2615937Z         x1 = x[:, D:]
2025-05-07T20:32:00.2616144Z     
2025-05-07T20:32:00.2616335Z         if contiguous:
2025-05-07T20:32:00.2616576Z             x0 = x0.contiguous()
2025-05-07T20:32:00.2616841Z             x1 = x1.contiguous()
2025-05-07T20:32:00.2617080Z     
2025-05-07T20:32:00.2617276Z         if scale_ub is not None:
2025-05-07T20:32:00.2617549Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.2617888Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.2618312Z             )
2025-05-07T20:32:00.2618512Z         else:
2025-05-07T20:32:00.2618729Z             scale_ub_tensor = None
2025-05-07T20:32:00.2618989Z     
2025-05-07T20:32:00.2619233Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2619549Z             op = silu_mul_quant
2025-05-07T20:32:00.2619809Z             if compiled:
2025-05-07T20:32:00.2620071Z                 op = torch.compile(op)
2025-05-07T20:32:00.2620367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2620650Z     
2025-05-07T20:32:00.2620856Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.2621024Z 
2025-05-07T20:32:00.2621141Z moe/activation_test.py:117: 
2025-05-07T20:32:00.2621443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2621789Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.2622079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2622646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.2623273Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.2623941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.2624639Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.2625183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.2625880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.2626561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.2627100Z     kernel = self.compile(
2025-05-07T20:32:00.2627662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.2628337Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.2628752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2628985Z 
2025-05-07T20:32:00.2629201Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf9660d0>
2025-05-07T20:32:00.2630299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.2631684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf954720>}
2025-05-07T20:32:00.2633048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.2634221Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf90e730>
2025-05-07T20:32:00.2634529Z 
2025-05-07T20:32:00.2634701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.2635238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.2635722Z                            module_map=module_map)
2025-05-07T20:32:00.2636089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.2636461Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.2636734Z E       ^
2025-05-07T20:32:00.2637211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2637670Z 
2025-05-07T20:32:00.2638093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.2638911Z 
2025-05-07T20:32:00.3601838Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.3602833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.3603416Z     T=16384,
2025-05-07T20:32:00.3603624Z     D=5120,
2025-05-07T20:32:00.3603816Z     scale_ub=1200.0,
2025-05-07T20:32:00.3604042Z     contiguous=True,
2025-05-07T20:32:00.3604272Z     compiled=False,
2025-05-07T20:32:00.3604480Z )
2025-05-07T20:32:00.3604808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.3605314Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.3605595Z 
2025-05-07T20:32:00.3605684Z     @given(
2025-05-07T20:32:00.3605912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.3606236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.3606547Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.3606871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.3607211Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.3607616Z     )
2025-05-07T20:32:00.3607963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.3608414Z     def test_silu_mul_quant(
2025-05-07T20:32:00.3608662Z         self,
2025-05-07T20:32:00.3608865Z         T: int,
2025-05-07T20:32:00.3609062Z         D: int,
2025-05-07T20:32:00.3609284Z         scale_ub: Optional[float],
2025-05-07T20:32:00.3609563Z         contiguous: bool,
2025-05-07T20:32:00.3609799Z         compiled: bool,
2025-05-07T20:32:00.3610032Z     ) -> None:
2025-05-07T20:32:00.3610255Z         torch.manual_seed(2025)
2025-05-07T20:32:00.3610496Z     
2025-05-07T20:32:00.3610777Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.3611127Z     
2025-05-07T20:32:00.3611322Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.3611625Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.3611942Z         x = x_sign * x_clamp
2025-05-07T20:32:00.3612186Z         x0 = x[:, :D]
2025-05-07T20:32:00.3612414Z         x1 = x[:, D:]
2025-05-07T20:32:00.3612633Z     
2025-05-07T20:32:00.3612822Z         if contiguous:
2025-05-07T20:32:00.3613066Z             x0 = x0.contiguous()
2025-05-07T20:32:00.3613336Z             x1 = x1.contiguous()
2025-05-07T20:32:00.3613580Z     
2025-05-07T20:32:00.3613779Z         if scale_ub is not None:
2025-05-07T20:32:00.3614061Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.3614404Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.3614712Z             )
2025-05-07T20:32:00.3614912Z         else:
2025-05-07T20:32:00.3615132Z             scale_ub_tensor = None
2025-05-07T20:32:00.3615383Z     
2025-05-07T20:32:00.3615624Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.3615942Z             op = silu_mul_quant
2025-05-07T20:32:00.3616193Z             if compiled:
2025-05-07T20:32:00.3616449Z                 op = torch.compile(op)
2025-05-07T20:32:00.3616906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.3617184Z     
2025-05-07T20:32:00.3617382Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.3617546Z 
2025-05-07T20:32:00.3617655Z moe/activation_test.py:117: 
2025-05-07T20:32:00.3617947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.3618286Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.3618575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.3619271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.3619960Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.3620505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.3621196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.3621878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.3622457Z     kernel = self.compile(
2025-05-07T20:32:00.3623011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.3623678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.3624075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.3624316Z 
2025-05-07T20:32:00.3624525Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf9f3490>
2025-05-07T20:32:00.3625606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.3626995Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf955d00>}
2025-05-07T20:32:00.3628387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.3629414Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf907ab0>
2025-05-07T20:32:00.3629711Z 
2025-05-07T20:32:00.3629883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.3630418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.3630892Z                            module_map=module_map)
2025-05-07T20:32:00.3631256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.3631617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.3631887Z E       ^
2025-05-07T20:32:00.3632357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.3632826Z 
2025-05-07T20:32:00.3633248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.3633774Z 
2025-05-07T20:32:00.3633882Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.3634307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.3634706Z     T=1,
2025-05-07T20:32:00.3634900Z     D=7168,
2025-05-07T20:32:00.3635100Z     scale_ub=1200.0,
2025-05-07T20:32:00.3635324Z     contiguous=False,
2025-05-07T20:32:00.3635558Z     compiled=False,
2025-05-07T20:32:00.3635774Z )
2025-05-07T20:32:00.3636094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.3636592Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:00.3636862Z 
2025-05-07T20:32:00.3636951Z     @given(
2025-05-07T20:32:00.3637268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.3637595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.3637911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.3638248Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.3638859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.3639156Z     )
2025-05-07T20:32:00.3639517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.3639955Z     def test_silu_mul_quant(
2025-05-07T20:32:00.3640204Z         self,
2025-05-07T20:32:00.3640404Z         T: int,
2025-05-07T20:32:00.3640598Z         D: int,
2025-05-07T20:32:00.3640821Z         scale_ub: Optional[float],
2025-05-07T20:32:00.3641099Z         contiguous: bool,
2025-05-07T20:32:00.3641335Z         compiled: bool,
2025-05-07T20:32:00.3641562Z     ) -> None:
2025-05-07T20:32:00.3641779Z         torch.manual_seed(2025)
2025-05-07T20:32:00.3642017Z     
2025-05-07T20:32:00.3642384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.3642733Z     
2025-05-07T20:32:00.3642933Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.3643348Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.3643696Z         x = x_sign * x_clamp
2025-05-07T20:32:00.3643942Z         x0 = x[:, :D]
2025-05-07T20:32:00.3644159Z         x1 = x[:, D:]
2025-05-07T20:32:00.3644371Z     
2025-05-07T20:32:00.3644560Z         if contiguous:
2025-05-07T20:32:00.3644792Z             x0 = x0.contiguous()
2025-05-07T20:32:00.3645057Z             x1 = x1.contiguous()
2025-05-07T20:32:00.3645306Z     
2025-05-07T20:32:00.3645497Z         if scale_ub is not None:
2025-05-07T20:32:00.3645782Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.3646127Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.3646435Z             )
2025-05-07T20:32:00.3646631Z         else:
2025-05-07T20:32:00.3646854Z             scale_ub_tensor = None
2025-05-07T20:32:00.3647193Z     
2025-05-07T20:32:00.3647437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.3647762Z             op = silu_mul_quant
2025-05-07T20:32:00.3648024Z             if compiled:
2025-05-07T20:32:00.3648267Z                 op = torch.compile(op)
2025-05-07T20:32:00.3648565Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.3648846Z     
2025-05-07T20:32:00.3649036Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.3649211Z 
2025-05-07T20:32:00.3649313Z moe/activation_test.py:117: 
2025-05-07T20:32:00.3649609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.3649938Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.3650227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.3650926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.3651629Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.3652169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.3652860Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.3653581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.3654112Z     kernel = self.compile(
2025-05-07T20:32:00.3654656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.3665555Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.3665978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.3666211Z 
2025-05-07T20:32:00.3666430Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300336950>
2025-05-07T20:32:00.3667684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.3669083Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf9553a0>}
2025-05-07T20:32:00.3670432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.3671463Z context = <triton._C.libtriton.ir.context object at 0x7fd3003bafb0>
2025-05-07T20:32:00.3671753Z 
2025-05-07T20:32:00.3671932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.3672454Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.3672984Z                            module_map=module_map)
2025-05-07T20:32:00.3673383Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.3673762Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.3674030Z E       ^
2025-05-07T20:32:00.3674504Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.3674959Z 
2025-05-07T20:32:00.3675390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.3675907Z 
2025-05-07T20:32:00.7033673Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.7034426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.7034844Z     T=4096,
2025-05-07T20:32:00.7035042Z     D=7168,
2025-05-07T20:32:00.7035244Z     scale_ub=1200.0,
2025-05-07T20:32:00.7035485Z     contiguous=False,
2025-05-07T20:32:00.7035745Z     compiled=True,
2025-05-07T20:32:00.7036278Z )
2025-05-07T20:32:00.7036609Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.7037111Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:00.7037397Z 
2025-05-07T20:32:00.7037478Z     @given(
2025-05-07T20:32:00.7037717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.7038032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.7038335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.7038973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.7039312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.7039600Z     )
2025-05-07T20:32:00.7039965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.7040410Z     def test_silu_mul_quant(
2025-05-07T20:32:00.7040647Z         self,
2025-05-07T20:32:00.7040850Z         T: int,
2025-05-07T20:32:00.7041060Z         D: int,
2025-05-07T20:32:00.7041284Z         scale_ub: Optional[float],
2025-05-07T20:32:00.7041564Z         contiguous: bool,
2025-05-07T20:32:00.7041812Z         compiled: bool,
2025-05-07T20:32:00.7042050Z     ) -> None:
2025-05-07T20:32:00.7042269Z         torch.manual_seed(2025)
2025-05-07T20:32:00.7042529Z     
2025-05-07T20:32:00.7042836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.7043184Z     
2025-05-07T20:32:00.7043496Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.7043797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.7044107Z         x = x_sign * x_clamp
2025-05-07T20:32:00.7044355Z         x0 = x[:, :D]
2025-05-07T20:32:00.7044577Z         x1 = x[:, D:]
2025-05-07T20:32:00.7044788Z     
2025-05-07T20:32:00.7044982Z         if contiguous:
2025-05-07T20:32:00.7045221Z             x0 = x0.contiguous()
2025-05-07T20:32:00.7045476Z             x1 = x1.contiguous()
2025-05-07T20:32:00.7045727Z     
2025-05-07T20:32:00.7046092Z         if scale_ub is not None:
2025-05-07T20:32:00.7046370Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.7046712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.7047026Z             )
2025-05-07T20:32:00.7047224Z         else:
2025-05-07T20:32:00.7047446Z             scale_ub_tensor = None
2025-05-07T20:32:00.7047704Z     
2025-05-07T20:32:00.7047934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7048249Z             op = silu_mul_quant
2025-05-07T20:32:00.7048502Z             if compiled:
2025-05-07T20:32:00.7048750Z                 op = torch.compile(op)
2025-05-07T20:32:00.7049045Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7049325Z     
2025-05-07T20:32:00.7049520Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.7049684Z 
2025-05-07T20:32:00.7049786Z moe/activation_test.py:117: 
2025-05-07T20:32:00.7050093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7050521Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.7050798Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7051364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.7051929Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.7052593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.7053276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.7053815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.7054500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.7055163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.7055706Z     kernel = self.compile(
2025-05-07T20:32:00.7056324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.7056991Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.7057391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7057627Z 
2025-05-07T20:32:00.7057837Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3003ed610>
2025-05-07T20:32:00.7058926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.7060320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3003bc2c0>}
2025-05-07T20:32:00.7061664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.7062694Z context = <triton._C.libtriton.ir.context object at 0x7fd3003b58f0>
2025-05-07T20:32:00.7062989Z 
2025-05-07T20:32:00.7063158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.7063685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.7064202Z                            module_map=module_map)
2025-05-07T20:32:00.7064575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.7064936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.7065201Z E       ^
2025-05-07T20:32:00.7065663Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7066121Z 
2025-05-07T20:32:00.7066627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.7067148Z 
2025-05-07T20:32:00.7067260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.7067679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.7068074Z     T=128,
2025-05-07T20:32:00.7068264Z     D=7168,
2025-05-07T20:32:00.7068460Z     scale_ub=1200.0,
2025-05-07T20:32:00.7068677Z     contiguous=False,
2025-05-07T20:32:00.7068902Z     compiled=True,
2025-05-07T20:32:00.7069105Z )
2025-05-07T20:32:00.7795234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.7796622Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:00.7797169Z 
2025-05-07T20:32:00.7797321Z     @given(
2025-05-07T20:32:00.7797787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.7798405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.7799375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.7800023Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.7800665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.7801223Z     )
2025-05-07T20:32:00.7801904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.7802777Z     def test_silu_mul_quant(
2025-05-07T20:32:00.7803431Z         self,
2025-05-07T20:32:00.7803818Z         T: int,
2025-05-07T20:32:00.7804109Z         D: int,
2025-05-07T20:32:00.7804332Z         scale_ub: Optional[float],
2025-05-07T20:32:00.7804601Z         contiguous: bool,
2025-05-07T20:32:00.7804843Z         compiled: bool,
2025-05-07T20:32:00.7805077Z     ) -> None:
2025-05-07T20:32:00.7805291Z         torch.manual_seed(2025)
2025-05-07T20:32:00.7805535Z     
2025-05-07T20:32:00.7805812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.7806155Z     
2025-05-07T20:32:00.7806459Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.7806752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.7807059Z         x = x_sign * x_clamp
2025-05-07T20:32:00.7807307Z         x0 = x[:, :D]
2025-05-07T20:32:00.7807526Z         x1 = x[:, D:]
2025-05-07T20:32:00.7807736Z     
2025-05-07T20:32:00.7807919Z         if contiguous:
2025-05-07T20:32:00.7808159Z             x0 = x0.contiguous()
2025-05-07T20:32:00.7808427Z             x1 = x1.contiguous()
2025-05-07T20:32:00.7808666Z     
2025-05-07T20:32:00.7808862Z         if scale_ub is not None:
2025-05-07T20:32:00.7809141Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.7809475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.7809785Z             )
2025-05-07T20:32:00.7809981Z         else:
2025-05-07T20:32:00.7810188Z             scale_ub_tensor = None
2025-05-07T20:32:00.7810444Z     
2025-05-07T20:32:00.7810683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7810996Z             op = silu_mul_quant
2025-05-07T20:32:00.7811245Z             if compiled:
2025-05-07T20:32:00.7811492Z                 op = torch.compile(op)
2025-05-07T20:32:00.7811786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7812066Z     
2025-05-07T20:32:00.7812259Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.7812424Z 
2025-05-07T20:32:00.7812530Z moe/activation_test.py:117: 
2025-05-07T20:32:00.7812817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7813180Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.7813474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7814072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.7814637Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.7815453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.7816151Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.7816693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.7817378Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.7818050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.7818583Z     kernel = self.compile(
2025-05-07T20:32:00.7819127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.7819794Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.7820188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7820424Z 
2025-05-07T20:32:00.7820639Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf7fe610>
2025-05-07T20:32:00.7821797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.7823187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3003bce00>}
2025-05-07T20:32:00.7824593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.7825617Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfc733b0>
2025-05-07T20:32:00.7825914Z 
2025-05-07T20:32:00.7826085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.7826612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.7827130Z                            module_map=module_map)
2025-05-07T20:32:00.7827491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.7827849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.7828110Z E       ^
2025-05-07T20:32:00.7828570Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7829028Z 
2025-05-07T20:32:00.7829449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.7829972Z 
2025-05-07T20:32:00.7830074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.7830489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.7830886Z     T=2048,
2025-05-07T20:32:00.7831074Z     D=7168,
2025-05-07T20:32:00.7831269Z     scale_ub=None,
2025-05-07T20:32:00.7831487Z     contiguous=True,
2025-05-07T20:32:00.7831721Z     compiled=True,
2025-05-07T20:32:00.7831925Z )
2025-05-07T20:32:00.7832242Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.7832734Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:00.7833002Z 
2025-05-07T20:32:00.7833087Z     @given(
2025-05-07T20:32:00.7833311Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.7833624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.7833932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.7834269Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.7834595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.7834882Z     )
2025-05-07T20:32:00.7835230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.7835668Z     def test_silu_mul_quant(
2025-05-07T20:32:00.7835914Z         self,
2025-05-07T20:32:00.7836197Z         T: int,
2025-05-07T20:32:00.7836396Z         D: int,
2025-05-07T20:32:00.7836617Z         scale_ub: Optional[float],
2025-05-07T20:32:00.7836889Z         contiguous: bool,
2025-05-07T20:32:00.7837122Z         compiled: bool,
2025-05-07T20:32:00.7837346Z     ) -> None:
2025-05-07T20:32:00.7837564Z         torch.manual_seed(2025)
2025-05-07T20:32:00.7837801Z     
2025-05-07T20:32:00.7838077Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.7838633Z     
2025-05-07T20:32:00.7838835Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.7839133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.7839443Z         x = x_sign * x_clamp
2025-05-07T20:32:00.7839689Z         x0 = x[:, :D]
2025-05-07T20:32:00.7839902Z         x1 = x[:, D:]
2025-05-07T20:32:00.7840112Z     
2025-05-07T20:32:00.7840307Z         if contiguous:
2025-05-07T20:32:00.7840539Z             x0 = x0.contiguous()
2025-05-07T20:32:00.7840881Z             x1 = x1.contiguous()
2025-05-07T20:32:00.7841126Z     
2025-05-07T20:32:00.7841317Z         if scale_ub is not None:
2025-05-07T20:32:00.7841592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.7841926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.7842230Z             )
2025-05-07T20:32:00.7842423Z         else:
2025-05-07T20:32:00.7842635Z             scale_ub_tensor = None
2025-05-07T20:32:00.7842882Z     
2025-05-07T20:32:00.7843114Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7843516Z             op = silu_mul_quant
2025-05-07T20:32:00.7843759Z             if compiled:
2025-05-07T20:32:00.7844012Z                 op = torch.compile(op)
2025-05-07T20:32:00.7844307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7844583Z     
2025-05-07T20:32:00.7844772Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.7844942Z 
2025-05-07T20:32:00.7845042Z moe/activation_test.py:117: 
2025-05-07T20:32:00.7845344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7845748Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.7846036Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7846598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.7847152Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.7847816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.7848505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.7849047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.7849730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.7850405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.7850948Z     kernel = self.compile(
2025-05-07T20:32:00.7851496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.7852153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.7852553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7852781Z 
2025-05-07T20:32:00.7853001Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf7b37d0>
2025-05-07T20:32:00.7854079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.7855623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3003bda80>}
2025-05-07T20:32:00.7856986Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.7858026Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf75fe30>
2025-05-07T20:32:00.7858319Z 
2025-05-07T20:32:00.7858494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.7859015Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.7859490Z                            module_map=module_map)
2025-05-07T20:32:00.7859862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.7860225Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.7860482Z E       ^
2025-05-07T20:32:00.7860957Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7861457Z 
2025-05-07T20:32:00.7861885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.7862400Z 
2025-05-07T20:32:00.8504476Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.8505144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.8505555Z     T=16384,
2025-05-07T20:32:00.8505753Z     D=5120,
2025-05-07T20:32:00.8505950Z     scale_ub=None,
2025-05-07T20:32:00.8506163Z     contiguous=False,
2025-05-07T20:32:00.8506393Z     compiled=False,
2025-05-07T20:32:00.8506596Z )
2025-05-07T20:32:00.8506919Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.8507413Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.8507695Z 
2025-05-07T20:32:00.8507776Z     @given(
2025-05-07T20:32:00.8508004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.8508567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.8508873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.8509201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.8509524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.8509813Z     )
2025-05-07T20:32:00.8510163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.8510603Z     def test_silu_mul_quant(
2025-05-07T20:32:00.8510840Z         self,
2025-05-07T20:32:00.8511039Z         T: int,
2025-05-07T20:32:00.8511242Z         D: int,
2025-05-07T20:32:00.8511452Z         scale_ub: Optional[float],
2025-05-07T20:32:00.8511728Z         contiguous: bool,
2025-05-07T20:32:00.8511969Z         compiled: bool,
2025-05-07T20:32:00.8512190Z     ) -> None:
2025-05-07T20:32:00.8512410Z         torch.manual_seed(2025)
2025-05-07T20:32:00.8512657Z     
2025-05-07T20:32:00.8512933Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.8513289Z     
2025-05-07T20:32:00.8513486Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.8513774Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.8515799Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.8517687Z 
2025-05-07T20:32:00.8517808Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:00.8518025Z 
2025-05-07T20:32:00.8518132Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.8518714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.8519123Z     T=4096,
2025-05-07T20:32:00.8519317Z     D=7168,
2025-05-07T20:32:00.8519513Z     scale_ub=1200.0,
2025-05-07T20:32:00.8519734Z     contiguous=True,
2025-05-07T20:32:00.8519963Z     compiled=True,
2025-05-07T20:32:00.8520174Z )
2025-05-07T20:32:00.8520502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.8520995Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.8521277Z 
2025-05-07T20:32:00.8521359Z     @given(
2025-05-07T20:32:00.8521593Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.8521899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.8522209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.8522545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.8522881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.8523379Z     )
2025-05-07T20:32:00.8523737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.8524182Z     def test_silu_mul_quant(
2025-05-07T20:32:00.8524416Z         self,
2025-05-07T20:32:00.8524623Z         T: int,
2025-05-07T20:32:00.8524836Z         D: int,
2025-05-07T20:32:00.8525055Z         scale_ub: Optional[float],
2025-05-07T20:32:00.8525344Z         contiguous: bool,
2025-05-07T20:32:00.8525593Z         compiled: bool,
2025-05-07T20:32:00.8525821Z     ) -> None:
2025-05-07T20:32:00.8526051Z         torch.manual_seed(2025)
2025-05-07T20:32:00.8526305Z     
2025-05-07T20:32:00.8526579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.8526930Z     
2025-05-07T20:32:00.8527137Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.8527434Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.8529455Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.8531389Z 
2025-05-07T20:32:00.8531513Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:00.8531741Z 
2025-05-07T20:32:00.8531849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.8532278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.8532687Z     T=16384,
2025-05-07T20:32:00.8532897Z     D=7168,
2025-05-07T20:32:00.8533104Z     scale_ub=None,
2025-05-07T20:32:00.8533318Z     contiguous=False,
2025-05-07T20:32:00.8533594Z     compiled=False,
2025-05-07T20:32:00.8533824Z )
2025-05-07T20:32:00.8534138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.8534643Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.8534929Z 
2025-05-07T20:32:00.8535008Z     @given(
2025-05-07T20:32:00.8535252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.8535563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.8535872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.8536202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.8536527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.8536820Z     )
2025-05-07T20:32:00.8537172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.8537618Z     def test_silu_mul_quant(
2025-05-07T20:32:00.8537854Z         self,
2025-05-07T20:32:00.8538178Z         T: int,
2025-05-07T20:32:00.8538653Z         D: int,
2025-05-07T20:32:00.8538873Z         scale_ub: Optional[float],
2025-05-07T20:32:00.8539149Z         contiguous: bool,
2025-05-07T20:32:00.8539394Z         compiled: bool,
2025-05-07T20:32:00.8539617Z     ) -> None:
2025-05-07T20:32:00.8539839Z         torch.manual_seed(2025)
2025-05-07T20:32:00.8540084Z     
2025-05-07T20:32:00.8540354Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.8542418Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.8544368Z 
2025-05-07T20:32:00.8544486Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.8544703Z 
2025-05-07T20:32:00.8544807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.8545292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.8556246Z     T=2048,
2025-05-07T20:32:00.8556452Z     D=7168,
2025-05-07T20:32:00.8556642Z     scale_ub=1200.0,
2025-05-07T20:32:00.8556879Z     contiguous=True,
2025-05-07T20:32:00.8557108Z     compiled=True,
2025-05-07T20:32:00.8557310Z )
2025-05-07T20:32:00.8557640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.8558146Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.8558418Z 
2025-05-07T20:32:00.8558506Z     @given(
2025-05-07T20:32:00.8558737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.8559069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.8559504Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.8559829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.8560158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.8560445Z     )
2025-05-07T20:32:00.8560785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.8561217Z     def test_silu_mul_quant(
2025-05-07T20:32:00.8561464Z         self,
2025-05-07T20:32:00.8561658Z         T: int,
2025-05-07T20:32:00.8561862Z         D: int,
2025-05-07T20:32:00.8562087Z         scale_ub: Optional[float],
2025-05-07T20:32:00.8562357Z         contiguous: bool,
2025-05-07T20:32:00.8562602Z         compiled: bool,
2025-05-07T20:32:00.8562835Z     ) -> None:
2025-05-07T20:32:00.8563058Z         torch.manual_seed(2025)
2025-05-07T20:32:00.8563435Z     
2025-05-07T20:32:00.8563717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.8564067Z     
2025-05-07T20:32:00.8564253Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.8564552Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.8566550Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.8568405Z 
2025-05-07T20:32:00.8568526Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:00.8568739Z 
2025-05-07T20:32:00.8568849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.8569383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.8569798Z     T=2048,
2025-05-07T20:32:00.8569995Z     D=7168,
2025-05-07T20:32:00.8570183Z     scale_ub=None,
2025-05-07T20:32:00.8570400Z     contiguous=True,
2025-05-07T20:32:00.8570629Z     compiled=False,
2025-05-07T20:32:00.8570831Z )
2025-05-07T20:32:00.9431780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.9433104Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.9433662Z 
2025-05-07T20:32:00.9433819Z     @given(
2025-05-07T20:32:00.9434157Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.9434467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.9434776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.9435104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.9435429Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.9435990Z     )
2025-05-07T20:32:00.9436354Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.9436794Z     def test_silu_mul_quant(
2025-05-07T20:32:00.9437037Z         self,
2025-05-07T20:32:00.9437234Z         T: int,
2025-05-07T20:32:00.9437431Z         D: int,
2025-05-07T20:32:00.9437659Z         scale_ub: Optional[float],
2025-05-07T20:32:00.9437932Z         contiguous: bool,
2025-05-07T20:32:00.9438166Z         compiled: bool,
2025-05-07T20:32:00.9438662Z     ) -> None:
2025-05-07T20:32:00.9438889Z         torch.manual_seed(2025)
2025-05-07T20:32:00.9439129Z     
2025-05-07T20:32:00.9439410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.9439751Z     
2025-05-07T20:32:00.9439948Z >       x_sign = torch.sign(x)
2025-05-07T20:32:00.9441891Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.9443939Z 
2025-05-07T20:32:00.9444057Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:00.9444273Z 
2025-05-07T20:32:00.9444375Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.9444786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.9445186Z     T=1,
2025-05-07T20:32:00.9445369Z     D=7168,
2025-05-07T20:32:00.9445562Z     scale_ub=1200.0,
2025-05-07T20:32:00.9445779Z     contiguous=True,
2025-05-07T20:32:00.9446005Z     compiled=False,
2025-05-07T20:32:00.9446217Z )
2025-05-07T20:32:00.9446538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.9447024Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.9447296Z 
2025-05-07T20:32:00.9447375Z     @given(
2025-05-07T20:32:00.9447605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.9447907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.9448217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.9448546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.9448867Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.9449151Z     )
2025-05-07T20:32:00.9449501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.9449941Z     def test_silu_mul_quant(
2025-05-07T20:32:00.9450175Z         self,
2025-05-07T20:32:00.9450374Z         T: int,
2025-05-07T20:32:00.9450571Z         D: int,
2025-05-07T20:32:00.9450785Z         scale_ub: Optional[float],
2025-05-07T20:32:00.9451228Z         contiguous: bool,
2025-05-07T20:32:00.9451476Z         compiled: bool,
2025-05-07T20:32:00.9451694Z     ) -> None:
2025-05-07T20:32:00.9451912Z         torch.manual_seed(2025)
2025-05-07T20:32:00.9452154Z     
2025-05-07T20:32:00.9452425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.9452772Z     
2025-05-07T20:32:00.9452974Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.9453261Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.9453575Z         x = x_sign * x_clamp
2025-05-07T20:32:00.9453818Z         x0 = x[:, :D]
2025-05-07T20:32:00.9454028Z         x1 = x[:, D:]
2025-05-07T20:32:00.9454243Z     
2025-05-07T20:32:00.9454432Z         if contiguous:
2025-05-07T20:32:00.9454665Z             x0 = x0.contiguous()
2025-05-07T20:32:00.9454928Z             x1 = x1.contiguous()
2025-05-07T20:32:00.9455171Z     
2025-05-07T20:32:00.9455368Z         if scale_ub is not None:
2025-05-07T20:32:00.9455717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.9456055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.9456365Z             )
2025-05-07T20:32:00.9456556Z         else:
2025-05-07T20:32:00.9456768Z             scale_ub_tensor = None
2025-05-07T20:32:00.9457023Z     
2025-05-07T20:32:00.9457253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.9457565Z             op = silu_mul_quant
2025-05-07T20:32:00.9457812Z             if compiled:
2025-05-07T20:32:00.9458056Z                 op = torch.compile(op)
2025-05-07T20:32:00.9458354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.9458635Z     
2025-05-07T20:32:00.9458826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.9458995Z 
2025-05-07T20:32:00.9459095Z moe/activation_test.py:117: 
2025-05-07T20:32:00.9459394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.9459728Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.9460014Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.9460756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.9461454Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.9461990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.9462678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.9463351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.9463888Z     kernel = self.compile(
2025-05-07T20:32:00.9464430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.9465095Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.9465542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.9465870Z 
2025-05-07T20:32:00.9466173Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf643b50>
2025-05-07T20:32:00.9467363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.9468744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf661440>}
2025-05-07T20:32:00.9470096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.9471128Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf7733b0>
2025-05-07T20:32:00.9471525Z 
2025-05-07T20:32:00.9471695Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.9472216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.9472685Z                            module_map=module_map)
2025-05-07T20:32:00.9473050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.9473402Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.9473664Z E       ^
2025-05-07T20:32:00.9474131Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.9474636Z 
2025-05-07T20:32:00.9475372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.9475898Z 
2025-05-07T20:32:00.9476002Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.9476925Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.9477618Z     T=128,
2025-05-07T20:32:00.9477849Z     D=5120,
2025-05-07T20:32:00.9478063Z     scale_ub=None,
2025-05-07T20:32:00.9478278Z     contiguous=True,
2025-05-07T20:32:00.9478505Z     compiled=False,
2025-05-07T20:32:00.9478710Z )
2025-05-07T20:32:01.0026597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.0027361Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.0027635Z 
2025-05-07T20:32:01.0027723Z     @given(
2025-05-07T20:32:01.0027950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.0028265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.0028575Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.0028899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.0029231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.0029520Z     )
2025-05-07T20:32:01.0029910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.0030612Z     def test_silu_mul_quant(
2025-05-07T20:32:01.0030857Z         self,
2025-05-07T20:32:01.0031061Z         T: int,
2025-05-07T20:32:01.0031261Z         D: int,
2025-05-07T20:32:01.0031483Z         scale_ub: Optional[float],
2025-05-07T20:32:01.0031760Z         contiguous: bool,
2025-05-07T20:32:01.0031996Z         compiled: bool,
2025-05-07T20:32:01.0032224Z     ) -> None:
2025-05-07T20:32:01.0032440Z         torch.manual_seed(2025)
2025-05-07T20:32:01.0032677Z     
2025-05-07T20:32:01.0032948Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.0033295Z     
2025-05-07T20:32:01.0033481Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.0033775Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.0034109Z         x = x_sign * x_clamp
2025-05-07T20:32:01.0034368Z         x0 = x[:, :D]
2025-05-07T20:32:01.0034588Z         x1 = x[:, D:]
2025-05-07T20:32:01.0034810Z     
2025-05-07T20:32:01.0034998Z         if contiguous:
2025-05-07T20:32:01.0035228Z             x0 = x0.contiguous()
2025-05-07T20:32:01.0035490Z             x1 = x1.contiguous()
2025-05-07T20:32:01.0035731Z     
2025-05-07T20:32:01.0035921Z         if scale_ub is not None:
2025-05-07T20:32:01.0036200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.0036538Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.0036840Z             )
2025-05-07T20:32:01.0037035Z         else:
2025-05-07T20:32:01.0037246Z             scale_ub_tensor = None
2025-05-07T20:32:01.0037492Z     
2025-05-07T20:32:01.0037729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.0038046Z             op = silu_mul_quant
2025-05-07T20:32:01.0038293Z             if compiled:
2025-05-07T20:32:01.0038824Z                 op = torch.compile(op)
2025-05-07T20:32:01.0039125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.0039586Z     
2025-05-07T20:32:01.0039788Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.0039959Z 
2025-05-07T20:32:01.0040060Z moe/activation_test.py:117: 
2025-05-07T20:32:01.0040358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.0040685Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.0040967Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.0041662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.0042347Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.0042891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.0043700Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.0044372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.0044997Z     kernel = self.compile(
2025-05-07T20:32:01.0045545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.0046206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.0046601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.0046837Z 
2025-05-07T20:32:01.0047048Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf608a10>
2025-05-07T20:32:01.0048124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.0049499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf662660>}
2025-05-07T20:32:01.0050914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.0051933Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf679030>
2025-05-07T20:32:01.0052225Z 
2025-05-07T20:32:01.0052393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.0052918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.0053384Z                            module_map=module_map)
2025-05-07T20:32:01.0053744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.0054097Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.0054362Z E       ^
2025-05-07T20:32:01.0054821Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.0055285Z 
2025-05-07T20:32:01.0055707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.0056228Z 
2025-05-07T20:32:01.0056331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.0056742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.0057137Z     T=128,
2025-05-07T20:32:01.0057328Z     D=7168,
2025-05-07T20:32:01.0057525Z     scale_ub=None,
2025-05-07T20:32:01.0057732Z     contiguous=True,
2025-05-07T20:32:01.0057956Z     compiled=False,
2025-05-07T20:32:01.0058164Z )
2025-05-07T20:32:01.0058478Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.0058967Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.0059241Z 
2025-05-07T20:32:01.0059320Z     @given(
2025-05-07T20:32:01.0059547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.0059943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.0060256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.0060584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.0060905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.0061193Z     )
2025-05-07T20:32:01.0061542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.0061976Z     def test_silu_mul_quant(
2025-05-07T20:32:01.0062221Z         self,
2025-05-07T20:32:01.0062417Z         T: int,
2025-05-07T20:32:01.0062609Z         D: int,
2025-05-07T20:32:01.0062828Z         scale_ub: Optional[float],
2025-05-07T20:32:01.0063103Z         contiguous: bool,
2025-05-07T20:32:01.0063335Z         compiled: bool,
2025-05-07T20:32:01.0063572Z     ) -> None:
2025-05-07T20:32:01.0063828Z         torch.manual_seed(2025)
2025-05-07T20:32:01.0064078Z     
2025-05-07T20:32:01.0064345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.0064799Z     
2025-05-07T20:32:01.0064996Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.0065283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.0065598Z         x = x_sign * x_clamp
2025-05-07T20:32:01.0065844Z         x0 = x[:, :D]
2025-05-07T20:32:01.0066055Z         x1 = x[:, D:]
2025-05-07T20:32:01.0066273Z     
2025-05-07T20:32:01.0066463Z         if contiguous:
2025-05-07T20:32:01.0066695Z             x0 = x0.contiguous()
2025-05-07T20:32:01.0066957Z             x1 = x1.contiguous()
2025-05-07T20:32:01.0067200Z     
2025-05-07T20:32:01.0067388Z         if scale_ub is not None:
2025-05-07T20:32:01.0067661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.0067999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.0068304Z             )
2025-05-07T20:32:01.0068498Z         else:
2025-05-07T20:32:01.0068709Z             scale_ub_tensor = None
2025-05-07T20:32:01.0068964Z     
2025-05-07T20:32:01.0069199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.0069565Z             op = silu_mul_quant
2025-05-07T20:32:01.0069817Z             if compiled:
2025-05-07T20:32:01.0070061Z                 op = torch.compile(op)
2025-05-07T20:32:01.0070362Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.0070637Z     
2025-05-07T20:32:01.0070825Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.0070997Z 
2025-05-07T20:32:01.0071096Z moe/activation_test.py:117: 
2025-05-07T20:32:01.0071393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.0071720Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.0072003Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.0072703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.0073401Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.0073947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.0074640Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.0075320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.0075856Z     kernel = self.compile(
2025-05-07T20:32:01.0076402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.0077072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.0077472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.0077701Z 
2025-05-07T20:32:01.0077910Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf40fb10>
2025-05-07T20:32:01.0079076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.0080457Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf6636a0>}
2025-05-07T20:32:01.0081812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.0082846Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf4180b0>
2025-05-07T20:32:01.0083136Z 
2025-05-07T20:32:01.0083421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.0083953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.0084422Z                            module_map=module_map)
2025-05-07T20:32:01.0084835Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.0085202Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.0085468Z E       ^
2025-05-07T20:32:01.0085942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.0086400Z 
2025-05-07T20:32:01.0086822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.0087343Z 
2025-05-07T20:32:01.0087448Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.0087865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.0088273Z     T=2048,
2025-05-07T20:32:01.0088462Z     D=7168,
2025-05-07T20:32:01.0088656Z     scale_ub=1200.0,
2025-05-07T20:32:01.0088881Z     contiguous=True,
2025-05-07T20:32:01.0089100Z     compiled=False,
2025-05-07T20:32:01.0089310Z )
2025-05-07T20:32:01.0760735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.0762472Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.0763026Z 
2025-05-07T20:32:01.0763188Z     @given(
2025-05-07T20:32:01.0763692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.0764049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.0764355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.0764677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.0765009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.0765297Z     )
2025-05-07T20:32:01.0765642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.0766089Z     def test_silu_mul_quant(
2025-05-07T20:32:01.0766332Z         self,
2025-05-07T20:32:01.0766531Z         T: int,
2025-05-07T20:32:01.0766728Z         D: int,
2025-05-07T20:32:01.0766953Z         scale_ub: Optional[float],
2025-05-07T20:32:01.0767234Z         contiguous: bool,
2025-05-07T20:32:01.0767467Z         compiled: bool,
2025-05-07T20:32:01.0767693Z     ) -> None:
2025-05-07T20:32:01.0767913Z         torch.manual_seed(2025)
2025-05-07T20:32:01.0768147Z     
2025-05-07T20:32:01.0768424Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.0770485Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.0772349Z 
2025-05-07T20:32:01.0772646Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.0772860Z 
2025-05-07T20:32:01.0772969Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.0773375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.0773842Z     T=1,
2025-05-07T20:32:01.0774036Z     D=5120,
2025-05-07T20:32:01.0774227Z     scale_ub=1200.0,
2025-05-07T20:32:01.0774457Z     contiguous=True,
2025-05-07T20:32:01.0774682Z     compiled=False,
2025-05-07T20:32:01.0774884Z )
2025-05-07T20:32:01.0775206Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.0775699Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.0775966Z 
2025-05-07T20:32:01.0776047Z     @given(
2025-05-07T20:32:01.0776278Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.0776593Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.0776902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.0777323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.0777660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.0777952Z     )
2025-05-07T20:32:01.0778293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.0778737Z     def test_silu_mul_quant(
2025-05-07T20:32:01.0778981Z         self,
2025-05-07T20:32:01.0779168Z         T: int,
2025-05-07T20:32:01.0779366Z         D: int,
2025-05-07T20:32:01.0779585Z         scale_ub: Optional[float],
2025-05-07T20:32:01.0790100Z         contiguous: bool,
2025-05-07T20:32:01.0790387Z         compiled: bool,
2025-05-07T20:32:01.0790623Z     ) -> None:
2025-05-07T20:32:01.0790848Z         torch.manual_seed(2025)
2025-05-07T20:32:01.0791089Z     
2025-05-07T20:32:01.0791379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.0791733Z     
2025-05-07T20:32:01.0791926Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.0792238Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.0792639Z         x = x_sign * x_clamp
2025-05-07T20:32:01.0792886Z         x0 = x[:, :D]
2025-05-07T20:32:01.0793104Z         x1 = x[:, D:]
2025-05-07T20:32:01.0793316Z     
2025-05-07T20:32:01.0793522Z         if contiguous:
2025-05-07T20:32:01.0793793Z             x0 = x0.contiguous()
2025-05-07T20:32:01.0794058Z             x1 = x1.contiguous()
2025-05-07T20:32:01.0794308Z     
2025-05-07T20:32:01.0794495Z         if scale_ub is not None:
2025-05-07T20:32:01.0794780Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.0795125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.0795435Z             )
2025-05-07T20:32:01.0795642Z         else:
2025-05-07T20:32:01.0795860Z             scale_ub_tensor = None
2025-05-07T20:32:01.0796117Z     
2025-05-07T20:32:01.0796356Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.0796678Z             op = silu_mul_quant
2025-05-07T20:32:01.0796934Z             if compiled:
2025-05-07T20:32:01.0797187Z                 op = torch.compile(op)
2025-05-07T20:32:01.0797489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.0797760Z     
2025-05-07T20:32:01.0797959Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.0798132Z 
2025-05-07T20:32:01.0798239Z moe/activation_test.py:117: 
2025-05-07T20:32:01.0798542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.0798874Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.0799160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.0799860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.0800547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.0801094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.0801870Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.0802553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.0803088Z     kernel = self.compile(
2025-05-07T20:32:01.0803732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.0804392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.0804780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.0805018Z 
2025-05-07T20:32:01.0805229Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf4cbbd0>
2025-05-07T20:32:01.0806316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.0807749Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf454b80>}
2025-05-07T20:32:01.0809099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.0810119Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf4c0230>
2025-05-07T20:32:01.0810420Z 
2025-05-07T20:32:01.0810589Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.0811118Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.0811592Z                            module_map=module_map)
2025-05-07T20:32:01.0811952Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.0812322Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.0812634Z E       ^
2025-05-07T20:32:01.0813097Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.0813556Z 
2025-05-07T20:32:01.0813976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.0814488Z 
2025-05-07T20:32:01.0814599Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.0815003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.0815406Z     T=2048,
2025-05-07T20:32:01.0815601Z     D=5120,
2025-05-07T20:32:01.0815788Z     scale_ub=None,
2025-05-07T20:32:01.0816007Z     contiguous=True,
2025-05-07T20:32:01.0816235Z     compiled=False,
2025-05-07T20:32:01.0816432Z )
2025-05-07T20:32:01.0816751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.0817253Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.0817526Z 
2025-05-07T20:32:01.0817599Z     @given(
2025-05-07T20:32:01.0817830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.0818142Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.0818446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.0818764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.0819091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.0819377Z     )
2025-05-07T20:32:01.0819721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.0820163Z     def test_silu_mul_quant(
2025-05-07T20:32:01.0820407Z         self,
2025-05-07T20:32:01.0820597Z         T: int,
2025-05-07T20:32:01.0820793Z         D: int,
2025-05-07T20:32:01.0821011Z         scale_ub: Optional[float],
2025-05-07T20:32:01.0821275Z         contiguous: bool,
2025-05-07T20:32:01.0821512Z         compiled: bool,
2025-05-07T20:32:01.0821820Z     ) -> None:
2025-05-07T20:32:01.0822032Z         torch.manual_seed(2025)
2025-05-07T20:32:01.0822273Z     
2025-05-07T20:32:01.0822543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.0822888Z     
2025-05-07T20:32:01.0823073Z >       x_sign = torch.sign(x)
2025-05-07T20:32:01.0825018Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.0826869Z 
2025-05-07T20:32:01.0826987Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:01.0827251Z 
2025-05-07T20:32:01.0827361Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.0827768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.0828170Z     T=16384,
2025-05-07T20:32:01.0828367Z     D=5120,
2025-05-07T20:32:01.0828559Z     scale_ub=None,
2025-05-07T20:32:01.0828766Z     contiguous=True,
2025-05-07T20:32:01.0828992Z     compiled=False,
2025-05-07T20:32:01.0829196Z )
2025-05-07T20:32:01.1531017Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.1532578Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.1533283Z 
2025-05-07T20:32:01.1533457Z     @given(
2025-05-07T20:32:01.1533814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.1534124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.1534437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.1534798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.1535369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.1535664Z     )
2025-05-07T20:32:01.1536022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.1536479Z     def test_silu_mul_quant(
2025-05-07T20:32:01.1536720Z         self,
2025-05-07T20:32:01.1536922Z         T: int,
2025-05-07T20:32:01.1537124Z         D: int,
2025-05-07T20:32:01.1537343Z         scale_ub: Optional[float],
2025-05-07T20:32:01.1537620Z         contiguous: bool,
2025-05-07T20:32:01.1537861Z         compiled: bool,
2025-05-07T20:32:01.1538091Z     ) -> None:
2025-05-07T20:32:01.1538309Z         torch.manual_seed(2025)
2025-05-07T20:32:01.1538866Z     
2025-05-07T20:32:01.1539135Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.1541186Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.1543066Z 
2025-05-07T20:32:01.1543184Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.1543405Z 
2025-05-07T20:32:01.1543509Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.1543975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.1544367Z     T=4096,
2025-05-07T20:32:01.1544559Z     D=5120,
2025-05-07T20:32:01.1544758Z     scale_ub=None,
2025-05-07T20:32:01.1544967Z     contiguous=True,
2025-05-07T20:32:01.1545196Z     compiled=False,
2025-05-07T20:32:01.1545407Z )
2025-05-07T20:32:01.1545888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.1546386Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.1546665Z 
2025-05-07T20:32:01.1546743Z     @given(
2025-05-07T20:32:01.1546972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.1547280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.1547587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.1547918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.1548242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.1548529Z     )
2025-05-07T20:32:01.1548880Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.1549318Z     def test_silu_mul_quant(
2025-05-07T20:32:01.1549565Z         self,
2025-05-07T20:32:01.1549765Z         T: int,
2025-05-07T20:32:01.1549958Z         D: int,
2025-05-07T20:32:01.1550265Z         scale_ub: Optional[float],
2025-05-07T20:32:01.1550544Z         contiguous: bool,
2025-05-07T20:32:01.1550775Z         compiled: bool,
2025-05-07T20:32:01.1551001Z     ) -> None:
2025-05-07T20:32:01.1551219Z         torch.manual_seed(2025)
2025-05-07T20:32:01.1551466Z     
2025-05-07T20:32:01.1551734Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.1553771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.1555622Z 
2025-05-07T20:32:01.1555821Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.1556036Z 
2025-05-07T20:32:01.1556148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.1556556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.1556964Z     T=2048,
2025-05-07T20:32:01.1557152Z     D=5120,
2025-05-07T20:32:01.1557348Z     scale_ub=None,
2025-05-07T20:32:01.1557557Z     contiguous=False,
2025-05-07T20:32:01.1557785Z     compiled=False,
2025-05-07T20:32:01.1557989Z )
2025-05-07T20:32:01.1558302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.1558798Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.1559069Z 
2025-05-07T20:32:01.1559155Z     @given(
2025-05-07T20:32:01.1559377Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.1559688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.1559999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.1560327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.1560669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.1560964Z     )
2025-05-07T20:32:01.1561308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.1561754Z     def test_silu_mul_quant(
2025-05-07T20:32:01.1562004Z         self,
2025-05-07T20:32:01.1562200Z         T: int,
2025-05-07T20:32:01.1562405Z         D: int,
2025-05-07T20:32:01.1562625Z         scale_ub: Optional[float],
2025-05-07T20:32:01.1562890Z         contiguous: bool,
2025-05-07T20:32:01.1563135Z         compiled: bool,
2025-05-07T20:32:01.1563462Z     ) -> None:
2025-05-07T20:32:01.1563673Z         torch.manual_seed(2025)
2025-05-07T20:32:01.1563918Z     
2025-05-07T20:32:01.1564199Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.1566350Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.1568201Z 
2025-05-07T20:32:01.1568328Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.1568538Z 
2025-05-07T20:32:01.1568639Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.1569055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.1569458Z     T=4096,
2025-05-07T20:32:01.1569640Z     D=7168,
2025-05-07T20:32:01.1569829Z     scale_ub=None,
2025-05-07T20:32:01.1570044Z     contiguous=True,
2025-05-07T20:32:01.1570321Z     compiled=True,
2025-05-07T20:32:01.1570522Z )
2025-05-07T20:32:01.1570841Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.1571332Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.1571599Z 
2025-05-07T20:32:01.1571676Z     @given(
2025-05-07T20:32:01.1571906Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.1572215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.1572516Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.1572845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.1573175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.1573458Z     )
2025-05-07T20:32:01.1573809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.1574252Z     def test_silu_mul_quant(
2025-05-07T20:32:01.1574502Z         self,
2025-05-07T20:32:01.1574699Z         T: int,
2025-05-07T20:32:01.1574947Z         D: int,
2025-05-07T20:32:01.1575169Z         scale_ub: Optional[float],
2025-05-07T20:32:01.1575436Z         contiguous: bool,
2025-05-07T20:32:01.1575677Z         compiled: bool,
2025-05-07T20:32:01.1575908Z     ) -> None:
2025-05-07T20:32:01.1576119Z         torch.manual_seed(2025)
2025-05-07T20:32:01.1576364Z     
2025-05-07T20:32:01.1576642Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.1578680Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.1580526Z 
2025-05-07T20:32:01.1580671Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.1580883Z 
2025-05-07T20:32:01.1580985Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.1581402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.1581804Z     T=2048,
2025-05-07T20:32:01.1581989Z     D=5120,
2025-05-07T20:32:01.1582183Z     scale_ub=1200.0,
2025-05-07T20:32:01.1582409Z     contiguous=False,
2025-05-07T20:32:01.1582638Z     compiled=False,
2025-05-07T20:32:01.1582837Z )
2025-05-07T20:32:01.1583158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.1583659Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.1583975Z 
2025-05-07T20:32:01.1584063Z     @given(
2025-05-07T20:32:01.1584297Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.1584695Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.1585002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.1585337Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.1585668Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.1585955Z     )
2025-05-07T20:32:01.1586304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.1586748Z     def test_silu_mul_quant(
2025-05-07T20:32:01.1586988Z         self,
2025-05-07T20:32:01.1587179Z         T: int,
2025-05-07T20:32:01.1587379Z         D: int,
2025-05-07T20:32:01.1587601Z         scale_ub: Optional[float],
2025-05-07T20:32:01.1587868Z         contiguous: bool,
2025-05-07T20:32:01.1588108Z         compiled: bool,
2025-05-07T20:32:01.1588332Z     ) -> None:
2025-05-07T20:32:01.1588544Z         torch.manual_seed(2025)
2025-05-07T20:32:01.1588788Z     
2025-05-07T20:32:01.1589061Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.1591156Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.1593007Z 
2025-05-07T20:32:01.1593130Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.1593341Z 
2025-05-07T20:32:01.1593445Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.1593858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.1594259Z     T=4096,
2025-05-07T20:32:01.1594444Z     D=7168,
2025-05-07T20:32:01.1594644Z     scale_ub=1200.0,
2025-05-07T20:32:01.1594911Z     contiguous=True,
2025-05-07T20:32:01.1595128Z     compiled=False,
2025-05-07T20:32:01.1595336Z )
2025-05-07T20:32:01.2518290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2519042Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.2519524Z 
2025-05-07T20:32:01.2519653Z     @given(
2025-05-07T20:32:01.2519950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2520267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2520576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2520906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2521230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2521523Z     )
2025-05-07T20:32:01.2521877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2522318Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2522607Z         self,
2025-05-07T20:32:01.2522809Z         T: int,
2025-05-07T20:32:01.2523004Z         D: int,
2025-05-07T20:32:01.2523335Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2523616Z         contiguous: bool,
2025-05-07T20:32:01.2523895Z         compiled: bool,
2025-05-07T20:32:01.2524122Z     ) -> None:
2025-05-07T20:32:01.2524344Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2524585Z     
2025-05-07T20:32:01.2524862Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2527243Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2529127Z 
2025-05-07T20:32:01.2529248Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.2529459Z 
2025-05-07T20:32:01.2529569Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2529977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2530386Z     T=16384,
2025-05-07T20:32:01.2530584Z     D=7168,
2025-05-07T20:32:01.2530772Z     scale_ub=None,
2025-05-07T20:32:01.2530988Z     contiguous=False,
2025-05-07T20:32:01.2531215Z     compiled=True,
2025-05-07T20:32:01.2531415Z )
2025-05-07T20:32:01.2531736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2532233Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.2532510Z 
2025-05-07T20:32:01.2532594Z     @given(
2025-05-07T20:32:01.2532824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2533231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2533537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2533860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2534186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2534476Z     )
2025-05-07T20:32:01.2534819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2535263Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2535504Z         self,
2025-05-07T20:32:01.2535702Z         T: int,
2025-05-07T20:32:01.2535897Z         D: int,
2025-05-07T20:32:01.2536117Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2536389Z         contiguous: bool,
2025-05-07T20:32:01.2536627Z         compiled: bool,
2025-05-07T20:32:01.2536854Z     ) -> None:
2025-05-07T20:32:01.2537075Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2537315Z     
2025-05-07T20:32:01.2537597Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2540014Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2541867Z 
2025-05-07T20:32:01.2541996Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.2542209Z 
2025-05-07T20:32:01.2542322Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2542728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2543141Z     T=4096,
2025-05-07T20:32:01.2543337Z     D=7168,
2025-05-07T20:32:01.2543522Z     scale_ub=None,
2025-05-07T20:32:01.2543740Z     contiguous=True,
2025-05-07T20:32:01.2543970Z     compiled=False,
2025-05-07T20:32:01.2544167Z )
2025-05-07T20:32:01.2544487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2544984Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.2545253Z 
2025-05-07T20:32:01.2545333Z     @given(
2025-05-07T20:32:01.2545563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2545879Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2546185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2546507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2546835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2547121Z     )
2025-05-07T20:32:01.2547597Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2548054Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2548302Z         self,
2025-05-07T20:32:01.2548495Z         T: int,
2025-05-07T20:32:01.2548699Z         D: int,
2025-05-07T20:32:01.2548920Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2549193Z         contiguous: bool,
2025-05-07T20:32:01.2549433Z         compiled: bool,
2025-05-07T20:32:01.2549658Z     ) -> None:
2025-05-07T20:32:01.2549873Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2550116Z     
2025-05-07T20:32:01.2550394Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2552440Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2554405Z 
2025-05-07T20:32:01.2554529Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.2554741Z 
2025-05-07T20:32:01.2554847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2555262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2555670Z     T=16384,
2025-05-07T20:32:01.2555860Z     D=7168,
2025-05-07T20:32:01.2556060Z     scale_ub=None,
2025-05-07T20:32:01.2556279Z     contiguous=True,
2025-05-07T20:32:01.2556502Z     compiled=False,
2025-05-07T20:32:01.2556709Z )
2025-05-07T20:32:01.2557032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2557529Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.2557808Z 
2025-05-07T20:32:01.2557961Z     @given(
2025-05-07T20:32:01.2558195Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2558511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2558813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2559143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2559491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2559773Z     )
2025-05-07T20:32:01.2560128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2560575Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2560821Z         self,
2025-05-07T20:32:01.2561015Z         T: int,
2025-05-07T20:32:01.2561221Z         D: int,
2025-05-07T20:32:01.2561445Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2561716Z         contiguous: bool,
2025-05-07T20:32:01.2561960Z         compiled: bool,
2025-05-07T20:32:01.2562196Z     ) -> None:
2025-05-07T20:32:01.2562419Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2562673Z     
2025-05-07T20:32:01.2562953Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2565158Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2567016Z 
2025-05-07T20:32:01.2567136Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.2567357Z 
2025-05-07T20:32:01.2567463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2568007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2568428Z     T=16384,
2025-05-07T20:32:01.2568620Z     D=7168,
2025-05-07T20:32:01.2568818Z     scale_ub=1200.0,
2025-05-07T20:32:01.2580374Z     contiguous=True,
2025-05-07T20:32:01.2580646Z     compiled=False,
2025-05-07T20:32:01.2580843Z )
2025-05-07T20:32:01.2581152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2581645Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.2581925Z 
2025-05-07T20:32:01.2582004Z     @given(
2025-05-07T20:32:01.2582226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2582529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2582824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2583141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2583453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2583815Z     )
2025-05-07T20:32:01.2584162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2584589Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2584819Z         self,
2025-05-07T20:32:01.2585005Z         T: int,
2025-05-07T20:32:01.2585190Z         D: int,
2025-05-07T20:32:01.2585400Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2585661Z         contiguous: bool,
2025-05-07T20:32:01.2585887Z         compiled: bool,
2025-05-07T20:32:01.2586100Z     ) -> None:
2025-05-07T20:32:01.2586316Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2586555Z     
2025-05-07T20:32:01.2586834Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2588890Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2590804Z 
2025-05-07T20:32:01.2590925Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.2591138Z 
2025-05-07T20:32:01.2591250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2591657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2592075Z     T=128,
2025-05-07T20:32:01.2592267Z     D=5120,
2025-05-07T20:32:01.2592457Z     scale_ub=1200.0,
2025-05-07T20:32:01.2592688Z     contiguous=False,
2025-05-07T20:32:01.2592919Z     compiled=False,
2025-05-07T20:32:01.2593123Z )
2025-05-07T20:32:01.3601822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3602404Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3602703Z 
2025-05-07T20:32:01.3602792Z     @given(
2025-05-07T20:32:01.3603019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3603449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3603764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3604138Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3604474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3604766Z     )
2025-05-07T20:32:01.3605125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3605563Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3605813Z         self,
2025-05-07T20:32:01.3606015Z         T: int,
2025-05-07T20:32:01.3606210Z         D: int,
2025-05-07T20:32:01.3606437Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3606714Z         contiguous: bool,
2025-05-07T20:32:01.3606952Z         compiled: bool,
2025-05-07T20:32:01.3607555Z     ) -> None:
2025-05-07T20:32:01.3607804Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3608054Z     
2025-05-07T20:32:01.3608328Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3608686Z     
2025-05-07T20:32:01.3608884Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3609176Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3609489Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3609738Z         x0 = x[:, :D]
2025-05-07T20:32:01.3609963Z         x1 = x[:, D:]
2025-05-07T20:32:01.3610173Z     
2025-05-07T20:32:01.3610369Z         if contiguous:
2025-05-07T20:32:01.3610609Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3610865Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3611112Z     
2025-05-07T20:32:01.3611311Z         if scale_ub is not None:
2025-05-07T20:32:01.3611585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3611936Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3612343Z             )
2025-05-07T20:32:01.3612533Z         else:
2025-05-07T20:32:01.3612747Z             scale_ub_tensor = None
2025-05-07T20:32:01.3613006Z     
2025-05-07T20:32:01.3613240Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3613562Z             op = silu_mul_quant
2025-05-07T20:32:01.3613815Z             if compiled:
2025-05-07T20:32:01.3614060Z                 op = torch.compile(op)
2025-05-07T20:32:01.3614372Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3614651Z     
2025-05-07T20:32:01.3614841Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3615010Z 
2025-05-07T20:32:01.3615112Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3615409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3615742Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3616027Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3616727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3617521Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3618054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3618742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3619410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3619940Z     kernel = self.compile(
2025-05-07T20:32:01.3620485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3621148Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3621548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3621778Z 
2025-05-07T20:32:01.3621996Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf5fc9d0>
2025-05-07T20:32:01.3623078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3624465Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf52b7e0>}
2025-05-07T20:32:01.3625815Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3626847Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf5732b0>
2025-05-07T20:32:01.3627135Z 
2025-05-07T20:32:01.3627385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3627920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3628392Z                            module_map=module_map)
2025-05-07T20:32:01.3628752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3629106Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3629369Z E       ^
2025-05-07T20:32:01.3629831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3630288Z 
2025-05-07T20:32:01.3630709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3631230Z 
2025-05-07T20:32:01.3631335Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3631751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3632148Z     T=2048,
2025-05-07T20:32:01.3632389Z     D=7168,
2025-05-07T20:32:01.3632586Z     scale_ub=None,
2025-05-07T20:32:01.3632799Z     contiguous=False,
2025-05-07T20:32:01.3633028Z     compiled=False,
2025-05-07T20:32:01.3633237Z )
2025-05-07T20:32:01.3633560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3634053Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3634353Z 
2025-05-07T20:32:01.3634437Z     @given(
2025-05-07T20:32:01.3634697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3635006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3635323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3635655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3635978Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3636269Z     )
2025-05-07T20:32:01.3636620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3637073Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3637357Z         self,
2025-05-07T20:32:01.3637556Z         T: int,
2025-05-07T20:32:01.3637756Z         D: int,
2025-05-07T20:32:01.3637972Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3638247Z         contiguous: bool,
2025-05-07T20:32:01.3638755Z         compiled: bool,
2025-05-07T20:32:01.3638978Z     ) -> None:
2025-05-07T20:32:01.3639196Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3639437Z     
2025-05-07T20:32:01.3639703Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3641767Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3643715Z 
2025-05-07T20:32:01.3643835Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3644051Z 
2025-05-07T20:32:01.3644156Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3644566Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3644962Z     T=128,
2025-05-07T20:32:01.3645151Z     D=7168,
2025-05-07T20:32:01.3645345Z     scale_ub=1200.0,
2025-05-07T20:32:01.3645561Z     contiguous=True,
2025-05-07T20:32:01.3645784Z     compiled=True,
2025-05-07T20:32:01.3645988Z )
2025-05-07T20:32:01.3954471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3955271Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.3955592Z 
2025-05-07T20:32:01.3955678Z     @given(
2025-05-07T20:32:01.3956161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3956483Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3956786Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3957126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3957462Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3957754Z     )
2025-05-07T20:32:01.3958104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3958556Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3958803Z         self,
2025-05-07T20:32:01.3958997Z         T: int,
2025-05-07T20:32:01.3959204Z         D: int,
2025-05-07T20:32:01.3959429Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3959699Z         contiguous: bool,
2025-05-07T20:32:01.3959945Z         compiled: bool,
2025-05-07T20:32:01.3960183Z     ) -> None:
2025-05-07T20:32:01.3960398Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3960728Z     
2025-05-07T20:32:01.3961008Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3961347Z     
2025-05-07T20:32:01.3961552Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3961850Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3962165Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3962405Z         x0 = x[:, :D]
2025-05-07T20:32:01.3962628Z         x1 = x[:, D:]
2025-05-07T20:32:01.3962847Z     
2025-05-07T20:32:01.3963036Z         if contiguous:
2025-05-07T20:32:01.3963465Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3963745Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3964022Z     
2025-05-07T20:32:01.3964221Z         if scale_ub is not None:
2025-05-07T20:32:01.3964496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3964829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3965144Z             )
2025-05-07T20:32:01.3965345Z         else:
2025-05-07T20:32:01.3965564Z             scale_ub_tensor = None
2025-05-07T20:32:01.3965913Z     
2025-05-07T20:32:01.3966154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3966466Z             op = silu_mul_quant
2025-05-07T20:32:01.3966744Z             if compiled:
2025-05-07T20:32:01.3966999Z                 op = torch.compile(op)
2025-05-07T20:32:01.3967294Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3967577Z     
2025-05-07T20:32:01.3967777Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3967943Z 
2025-05-07T20:32:01.3968047Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3968347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3968687Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3968971Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3969542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3970120Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3970796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3971487Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3972034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3972726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3973401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3973938Z     kernel = self.compile(
2025-05-07T20:32:01.3974537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3975211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3975721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3975965Z 
2025-05-07T20:32:01.3976176Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf8f49d0>
2025-05-07T20:32:01.3977264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3978657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf8d6a20>}
2025-05-07T20:32:01.3980006Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3981030Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf3526b0>
2025-05-07T20:32:01.3981370Z 
2025-05-07T20:32:01.3981549Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3982073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3982545Z                            module_map=module_map)
2025-05-07T20:32:01.3982906Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3983268Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3983529Z E       ^
2025-05-07T20:32:01.3983994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3984512Z 
2025-05-07T20:32:01.3984935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3985463Z 
2025-05-07T20:32:01.3985568Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3985983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3986440Z     T=128,
2025-05-07T20:32:01.3986630Z     D=7168,
2025-05-07T20:32:01.3986826Z     scale_ub=1200.0,
2025-05-07T20:32:01.3987048Z     contiguous=True,
2025-05-07T20:32:01.3987274Z     compiled=False,
2025-05-07T20:32:01.3987491Z )
2025-05-07T20:32:01.3987819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3988325Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.3988592Z 
2025-05-07T20:32:01.3988675Z     @given(
2025-05-07T20:32:01.3988905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3989225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3989535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3989870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3990198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3990488Z     )
2025-05-07T20:32:01.3990848Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3991290Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3991536Z         self,
2025-05-07T20:32:01.3991733Z         T: int,
2025-05-07T20:32:01.3991931Z         D: int,
2025-05-07T20:32:01.3992153Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3992440Z         contiguous: bool,
2025-05-07T20:32:01.3992680Z         compiled: bool,
2025-05-07T20:32:01.3992909Z     ) -> None:
2025-05-07T20:32:01.3993129Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3993366Z     
2025-05-07T20:32:01.3993648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3993997Z     
2025-05-07T20:32:01.3994199Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3994489Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3996587Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3998453Z 
2025-05-07T20:32:01.3998573Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.3998785Z 
2025-05-07T20:32:01.3998893Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3999304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3999713Z     T=128,
2025-05-07T20:32:01.3999904Z     D=5120,
2025-05-07T20:32:01.4000097Z     scale_ub=1200.0,
2025-05-07T20:32:01.4000317Z     contiguous=True,
2025-05-07T20:32:01.4000543Z     compiled=True,
2025-05-07T20:32:01.4000747Z )
2025-05-07T20:32:01.4001106Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.4001601Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.4001869Z 
2025-05-07T20:32:01.4001952Z     @given(
2025-05-07T20:32:01.4002180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.4002496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.4002804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.4003131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.4003610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.4003896Z     )
2025-05-07T20:32:01.4004245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.4004688Z     def test_silu_mul_quant(
2025-05-07T20:32:01.4004932Z         self,
2025-05-07T20:32:01.4005129Z         T: int,
2025-05-07T20:32:01.4005323Z         D: int,
2025-05-07T20:32:01.4005557Z         scale_ub: Optional[float],
2025-05-07T20:32:01.4005895Z         contiguous: bool,
2025-05-07T20:32:01.4006129Z         compiled: bool,
2025-05-07T20:32:01.4006355Z     ) -> None:
2025-05-07T20:32:01.4006576Z         torch.manual_seed(2025)
2025-05-07T20:32:01.4006817Z     
2025-05-07T20:32:01.4007100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.4007444Z     
2025-05-07T20:32:01.4007637Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.4007935Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.4009942Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.4011810Z 
2025-05-07T20:32:01.4011929Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.4012141Z 
2025-05-07T20:32:01.4012259Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.4012673Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.4013083Z     T=128,
2025-05-07T20:32:01.4013282Z     D=7168,
2025-05-07T20:32:01.4013472Z     scale_ub=None,
2025-05-07T20:32:01.4013692Z     contiguous=True,
2025-05-07T20:32:01.4013945Z     compiled=True,
2025-05-07T20:32:01.4014165Z )
2025-05-07T20:32:01.8518747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8519449Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.8519783Z 
2025-05-07T20:32:01.8519869Z     @given(
2025-05-07T20:32:01.8520552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8520887Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8521202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8521538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8521864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8522164Z     )
2025-05-07T20:32:01.8522520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8522965Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8523202Z         self,
2025-05-07T20:32:01.8523570Z         T: int,
2025-05-07T20:32:01.8523774Z         D: int,
2025-05-07T20:32:01.8523986Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8524264Z         contiguous: bool,
2025-05-07T20:32:01.8524505Z         compiled: bool,
2025-05-07T20:32:01.8524730Z     ) -> None:
2025-05-07T20:32:01.8524950Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8525195Z     
2025-05-07T20:32:01.8525474Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8527652Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.8529529Z 
2025-05-07T20:32:01.8529649Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.8529869Z 
2025-05-07T20:32:01.8562813Z FAILED
2025-05-07T20:32:01.8562987Z 
2025-05-07T20:32:01.8563189Z =================================== FAILURES ===================================
2025-05-07T20:32:01.8563975Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:01.8564815Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:01.8565726Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:01.8566551Z   |     yield
2025-05-07T20:32:01.8567193Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:01.8567987Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:01.8568966Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:01.8569801Z   |     if method() is not None:
2025-05-07T20:32:01.8570162Z   |        ^^^^^^^^
2025-05-07T20:32:01.8571133Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:01.8572525Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8572975Z   |            ^^^^^^^
2025-05-07T20:32:01.8573820Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:01.8574834Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:01.8575454Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:01.8576057Z   +-+---------------- 1 ----------------
2025-05-07T20:32:01.8576480Z     | Traceback (most recent call last):
2025-05-07T20:32:01.8577522Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:01.8578675Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8579230Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8598740Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.8601767Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:01.8602416Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8603005Z     |     T=2048,
2025-05-07T20:32:01.8603476Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:01.8603973Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:01.8604501Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:01.8605144Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:01.8605591Z     | )
2025-05-07T20:32:01.8605840Z     | 
2025-05-07T20:32:01.8606615Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:01.8607522Z     +---------------- 2 ----------------
2025-05-07T20:32:01.8607948Z     | Traceback (most recent call last):
2025-05-07T20:32:01.8608992Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:01.8610145Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8610699Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8613683Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.8616774Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:01.8617402Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8618009Z     |     T=128,
2025-05-07T20:32:01.8618302Z     |     D=7168,
2025-05-07T20:32:01.8618597Z     |     scale_ub=None,
2025-05-07T20:32:01.8618947Z     |     contiguous=True,
2025-05-07T20:32:01.8619307Z     |     compiled=True,
2025-05-07T20:32:01.8619626Z     | )
2025-05-07T20:32:01.8619887Z     | 
2025-05-07T20:32:01.8620602Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:01.8621223Z     +---------------- 3 ----------------
2025-05-07T20:32:01.8621511Z     | Traceback (most recent call last):
2025-05-07T20:32:01.8622230Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:01.8623013Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8623387Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8625511Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.8627482Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:01.8627927Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8628335Z     |     T=128,
2025-05-07T20:32:01.8628535Z     |     D=5120,
2025-05-07T20:32:01.8628749Z     |     scale_ub=1200.0,
2025-05-07T20:32:01.8628998Z     |     contiguous=True,
2025-05-07T20:32:01.8629235Z     |     compiled=True,
2025-05-07T20:32:01.8629464Z     | )
2025-05-07T20:32:01.8629651Z     | 
2025-05-07T20:32:01.8630174Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:01.8630781Z     +---------------- 4 ----------------
2025-05-07T20:32:01.8631120Z     | Traceback (most recent call last):
2025-05-07T20:32:01.8631829Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:01.8632542Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.8632829Z     |                              ^^^^^^^^
2025-05-07T20:32:01.8633464Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:01.8634161Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8634502Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8635306Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:01.8636099Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.8636718Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:01.8637503Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8637943Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8639027Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:01.8639815Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8640297Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8640970Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:01.8641774Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8642243Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8642881Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:01.8643672Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.8644040Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8644637Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:01.8645206Z     |     fn()
2025-05-07T20:32:01.8645769Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:01.8646397Z     |     self.fn.run(
2025-05-07T20:32:01.8647082Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:01.8647669Z     |     kernel = self.compile(
2025-05-07T20:32:01.8647923Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:01.8648522Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:01.8649232Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8649616Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8650267Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.8651062Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8651549Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.8652041Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8652445Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.8652741Z     | ^
2025-05-07T20:32:01.8653283Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8654009Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:01.8654473Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:01.8655085Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8655587Z     |     T=1,  # or any other generated value
2025-05-07T20:32:01.8655940Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:01.8656328Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:01.8656734Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:01.8657159Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:01.8657549Z     | )
2025-05-07T20:32:01.8657735Z     | 
2025-05-07T20:32:01.8658256Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:01.8658882Z     +------------------------------------
2025-05-07T20:32:01.8659250Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:01.8659621Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8660040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8660445Z     T=1,
2025-05-07T20:32:01.8660627Z     D=5120,
2025-05-07T20:32:01.8660825Z     scale_ub=None,
2025-05-07T20:32:01.8661040Z     contiguous=True,
2025-05-07T20:32:01.8661262Z     compiled=True,
2025-05-07T20:32:01.8661464Z )
2025-05-07T20:32:01.8661781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8662271Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.8662534Z 
2025-05-07T20:32:01.8662609Z     @given(
2025-05-07T20:32:01.8662844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8663158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8663456Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8663787Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8664119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8664401Z     )
2025-05-07T20:32:01.8664751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8665192Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8665427Z         self,
2025-05-07T20:32:01.8665613Z         T: int,
2025-05-07T20:32:01.8665814Z         D: int,
2025-05-07T20:32:01.8666034Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8666299Z         contiguous: bool,
2025-05-07T20:32:01.8666628Z         compiled: bool,
2025-05-07T20:32:01.8666859Z     ) -> None:
2025-05-07T20:32:01.8667066Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8667305Z     
2025-05-07T20:32:01.8667575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8667909Z     
2025-05-07T20:32:01.8668102Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8668393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8668693Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8668935Z         x0 = x[:, :D]
2025-05-07T20:32:01.8669150Z         x1 = x[:, D:]
2025-05-07T20:32:01.8669351Z     
2025-05-07T20:32:01.8669547Z         if contiguous:
2025-05-07T20:32:01.8669783Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8670035Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8670281Z     
2025-05-07T20:32:01.8670477Z         if scale_ub is not None:
2025-05-07T20:32:01.8670752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8671140Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8671453Z             )
2025-05-07T20:32:01.8671654Z         else:
2025-05-07T20:32:01.8671857Z             scale_ub_tensor = None
2025-05-07T20:32:01.8672112Z     
2025-05-07T20:32:01.8672351Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8672655Z             op = silu_mul_quant
2025-05-07T20:32:01.8672906Z             if compiled:
2025-05-07T20:32:01.8673158Z                 op = torch.compile(op)
2025-05-07T20:32:01.8673444Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8673724Z     
2025-05-07T20:32:01.8673915Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.8674191Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.8674484Z     
2025-05-07T20:32:01.8674721Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8675054Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.8675337Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.8675714Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.8676074Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8676397Z     
2025-05-07T20:32:01.8676600Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.8676792Z 
2025-05-07T20:32:01.8676899Z moe/activation_test.py:126: 
2025-05-07T20:32:01.8677189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8677525Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.8677855Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8678649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.8679410Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.8679959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8680654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8681344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.8682074Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8682831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.8683696Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8684433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.8685091Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.8685781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.8686307Z     fn()
2025-05-07T20:32:01.8686820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.8687402Z     self.fn.run(
2025-05-07T20:32:01.8687874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8688400Z     kernel = self.compile(
2025-05-07T20:32:01.8688942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8689599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8689986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8690218Z 
2025-05-07T20:32:01.8690425Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3a0408110>
2025-05-07T20:32:01.8691512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8692966Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd39c2c7ce0>}
2025-05-07T20:32:01.8694599Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8696086Z context = <triton._C.libtriton.ir.context object at 0x7fd39c3b93f0>
2025-05-07T20:32:01.8696501Z 
2025-05-07T20:32:01.8696739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8697512Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8698205Z                            module_map=module_map)
2025-05-07T20:32:01.8698820Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8699313Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.8699681Z E       ^
2025-05-07T20:32:01.8700317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8700928Z 
2025-05-07T20:32:01.8701499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8702211Z 
2025-05-07T20:32:01.8702352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8702929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8703479Z     T=2048,
2025-05-07T20:32:01.8703738Z     D=5120,
2025-05-07T20:32:01.8704004Z     scale_ub=1200.0,
2025-05-07T20:32:01.8704305Z     contiguous=True,
2025-05-07T20:32:01.8704614Z     compiled=False,
2025-05-07T20:32:01.8704918Z )
2025-05-07T20:32:01.8705362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8706053Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.8706438Z 
2025-05-07T20:32:01.8706552Z     @given(
2025-05-07T20:32:01.8706860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8707295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8707723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8708179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8708628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8709022Z     )
2025-05-07T20:32:01.8709504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8710112Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8710445Z         self,
2025-05-07T20:32:01.8710708Z         T: int,
2025-05-07T20:32:01.8710973Z         D: int,
2025-05-07T20:32:01.8711384Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8711768Z         contiguous: bool,
2025-05-07T20:32:01.8712098Z         compiled: bool,
2025-05-07T20:32:01.8712408Z     ) -> None:
2025-05-07T20:32:01.8712701Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8713036Z     
2025-05-07T20:32:01.8713410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8713887Z     
2025-05-07T20:32:01.8714155Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8714553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8714984Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8715318Z         x0 = x[:, :D]
2025-05-07T20:32:01.8715610Z         x1 = x[:, D:]
2025-05-07T20:32:01.8715900Z     
2025-05-07T20:32:01.8716159Z         if contiguous:
2025-05-07T20:32:01.8716472Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8716835Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8717169Z     
2025-05-07T20:32:01.8717436Z         if scale_ub is not None:
2025-05-07T20:32:01.8717884Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8718357Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8718774Z             )
2025-05-07T20:32:01.8719023Z         else:
2025-05-07T20:32:01.8719293Z             scale_ub_tensor = None
2025-05-07T20:32:01.8719612Z     
2025-05-07T20:32:01.8719909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8720312Z             op = silu_mul_quant
2025-05-07T20:32:01.8720631Z             if compiled:
2025-05-07T20:32:01.8720939Z                 op = torch.compile(op)
2025-05-07T20:32:01.8721322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8721695Z     
2025-05-07T20:32:01.8721934Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.8722155Z 
2025-05-07T20:32:01.8722290Z moe/activation_test.py:117: 
2025-05-07T20:32:01.8722673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8723114Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.8723715Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8724660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.8725572Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.8726307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8727256Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8728184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8728911Z     kernel = self.compile(
2025-05-07T20:32:01.8729620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8730522Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8731090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8731390Z 
2025-05-07T20:32:01.8731653Z self = <triton.compiler.compiler.ASTSource object at 0x7fd39c10a110>
2025-05-07T20:32:01.8733063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8734881Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd39c531760>}
2025-05-07T20:32:01.8736644Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8738063Z context = <triton._C.libtriton.ir.context object at 0x7fd39c1c96f0>
2025-05-07T20:32:01.8738723Z 
2025-05-07T20:32:01.8738943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8739622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8740224Z                            module_map=module_map)
2025-05-07T20:32:01.8740687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8741149Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.8741487Z E       ^
2025-05-07T20:32:01.8742093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8742714Z 
2025-05-07T20:32:01.8743298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8744029Z 
2025-05-07T20:32:01.8744173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8744901Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8745465Z     T=2048,
2025-05-07T20:32:01.8745726Z     D=5120,
2025-05-07T20:32:01.8745994Z     scale_ub=1200.0,
2025-05-07T20:32:01.8746304Z     contiguous=True,
2025-05-07T20:32:01.8746631Z     compiled=True,
2025-05-07T20:32:01.8746921Z )
2025-05-07T20:32:01.8747379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8748049Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.8748429Z 
2025-05-07T20:32:01.8748536Z     @given(
2025-05-07T20:32:01.8749361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8749827Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8750262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8750728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8751182Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8751598Z     )
2025-05-07T20:32:01.8752198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8752812Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8753131Z         self,
2025-05-07T20:32:01.8753398Z         T: int,
2025-05-07T20:32:01.8753668Z         D: int,
2025-05-07T20:32:01.8754003Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8754367Z         contiguous: bool,
2025-05-07T20:32:01.8754685Z         compiled: bool,
2025-05-07T20:32:01.8754978Z     ) -> None:
2025-05-07T20:32:01.8755264Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8755581Z     
2025-05-07T20:32:01.8755929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8756378Z     
2025-05-07T20:32:01.8756634Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8757007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8757412Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8757732Z         x0 = x[:, :D]
2025-05-07T20:32:01.8758017Z         x1 = x[:, D:]
2025-05-07T20:32:01.8758292Z     
2025-05-07T20:32:01.8758532Z         if contiguous:
2025-05-07T20:32:01.8758824Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8759160Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8759473Z     
2025-05-07T20:32:01.8760277Z         if scale_ub is not None:
2025-05-07T20:32:01.8760834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8761274Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8761674Z             )
2025-05-07T20:32:01.8761917Z         else:
2025-05-07T20:32:01.8762192Z             scale_ub_tensor = None
2025-05-07T20:32:01.8762518Z     
2025-05-07T20:32:01.8762809Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8763397Z             op = silu_mul_quant
2025-05-07T20:32:01.8763741Z             if compiled:
2025-05-07T20:32:01.8764068Z                 op = torch.compile(op)
2025-05-07T20:32:01.8764644Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8765026Z     
2025-05-07T20:32:01.8765271Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.8765644Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.8766036Z     
2025-05-07T20:32:01.8766347Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8766790Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.8767177Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.8767596Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.8768062Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8768483Z     
2025-05-07T20:32:01.8768753Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.8769017Z 
2025-05-07T20:32:01.8769146Z moe/activation_test.py:126: 
2025-05-07T20:32:01.8769528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8769963Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.8770451Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8771484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.8772476Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.8773184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8774071Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8790355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.8791427Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8792472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.8793534Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8794668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.8795556Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.8796375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.8797079Z     fn()
2025-05-07T20:32:01.8797756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.8798550Z     self.fn.run(
2025-05-07T20:32:01.8799208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8799970Z     kernel = self.compile(
2025-05-07T20:32:01.8800706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8801609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8802137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8802454Z 
2025-05-07T20:32:01.8802738Z self = <triton.compiler.compiler.ASTSource object at 0x7fd396acea90>
2025-05-07T20:32:01.8804441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8806391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396be71a0>}
2025-05-07T20:32:01.8808364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8809770Z context = <triton._C.libtriton.ir.context object at 0x7fd396ad28b0>
2025-05-07T20:32:01.8810156Z 
2025-05-07T20:32:01.8810378Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8811082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8811718Z                            module_map=module_map)
2025-05-07T20:32:01.8812206Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8812675Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.8813034Z E       ^
2025-05-07T20:32:01.8813665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8814337Z 
2025-05-07T20:32:01.8814905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8815685Z 
2025-05-07T20:32:01.8815827Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8816403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8816969Z     T=16384,
2025-05-07T20:32:01.8817229Z     D=7168,
2025-05-07T20:32:01.8817496Z     scale_ub=1200.0,
2025-05-07T20:32:01.8817808Z     contiguous=False,
2025-05-07T20:32:01.8818121Z     compiled=False,
2025-05-07T20:32:01.8818398Z )
2025-05-07T20:32:01.8818827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8819503Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.8819883Z 
2025-05-07T20:32:01.8819984Z     @given(
2025-05-07T20:32:01.8820291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8820707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8821109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8821551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8822058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8822434Z     )
2025-05-07T20:32:01.8822903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8823510Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8823828Z         self,
2025-05-07T20:32:01.8824095Z         T: int,
2025-05-07T20:32:01.8824367Z         D: int,
2025-05-07T20:32:01.8824654Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8825023Z         contiguous: bool,
2025-05-07T20:32:01.8825350Z         compiled: bool,
2025-05-07T20:32:01.8825648Z     ) -> None:
2025-05-07T20:32:01.8825951Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8826293Z     
2025-05-07T20:32:01.8826670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8827142Z     
2025-05-07T20:32:01.8827416Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8827829Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8828265Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8828607Z         x0 = x[:, :D]
2025-05-07T20:32:01.8828918Z         x1 = x[:, D:]
2025-05-07T20:32:01.8829203Z     
2025-05-07T20:32:01.8829460Z         if contiguous:
2025-05-07T20:32:01.8829788Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8830143Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8830479Z     
2025-05-07T20:32:01.8830747Z         if scale_ub is not None:
2025-05-07T20:32:01.8831126Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8831593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8832024Z             )
2025-05-07T20:32:01.8832287Z         else:
2025-05-07T20:32:01.8832560Z             scale_ub_tensor = None
2025-05-07T20:32:01.8832899Z     
2025-05-07T20:32:01.8833219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8833643Z             op = silu_mul_quant
2025-05-07T20:32:01.8833991Z             if compiled:
2025-05-07T20:32:01.8834496Z                 op = torch.compile(op)
2025-05-07T20:32:01.8834900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8835264Z     
2025-05-07T20:32:01.8835515Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.8835731Z 
2025-05-07T20:32:01.8835863Z moe/activation_test.py:117: 
2025-05-07T20:32:01.8836268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8836731Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.8837134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8838107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.8839367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.8840128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8841105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8842172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8842901Z     kernel = self.compile(
2025-05-07T20:32:01.8843794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8844755Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8845314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8845637Z 
2025-05-07T20:32:01.8845928Z self = <triton.compiler.compiler.ASTSource object at 0x7fd396326790>
2025-05-07T20:32:01.8847386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8849263Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3970572e0>}
2025-05-07T20:32:01.8851253Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8852675Z context = <triton._C.libtriton.ir.context object at 0x7fd396ae9ef0>
2025-05-07T20:32:01.8853068Z 
2025-05-07T20:32:01.8853258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8853774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8854265Z                            module_map=module_map)
2025-05-07T20:32:01.8854656Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8855013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.8855277Z E       ^
2025-05-07T20:32:01.8855751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8856202Z 
2025-05-07T20:32:01.8856629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8857148Z 
2025-05-07T20:32:01.8857262Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8857671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8858077Z     T=1,
2025-05-07T20:32:01.8858267Z     D=7168,
2025-05-07T20:32:01.8858452Z     scale_ub=None,
2025-05-07T20:32:01.8858664Z     contiguous=True,
2025-05-07T20:32:01.8858887Z     compiled=True,
2025-05-07T20:32:01.8859084Z )
2025-05-07T20:32:01.8859402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8859884Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.8860289Z 
2025-05-07T20:32:01.8860375Z     @given(
2025-05-07T20:32:01.8860606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8860918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8861217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8861547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8861875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8862163Z     )
2025-05-07T20:32:01.8862505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8862947Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8863185Z         self,
2025-05-07T20:32:01.8863373Z         T: int,
2025-05-07T20:32:01.8863572Z         D: int,
2025-05-07T20:32:01.8863789Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8864053Z         contiguous: bool,
2025-05-07T20:32:01.8864292Z         compiled: bool,
2025-05-07T20:32:01.8864512Z     ) -> None:
2025-05-07T20:32:01.8864777Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8865017Z     
2025-05-07T20:32:01.8865297Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8865635Z     
2025-05-07T20:32:01.8865829Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8866118Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8866428Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8866660Z         x0 = x[:, :D]
2025-05-07T20:32:01.8866876Z         x1 = x[:, D:]
2025-05-07T20:32:01.8867084Z     
2025-05-07T20:32:01.8867264Z         if contiguous:
2025-05-07T20:32:01.8867497Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8867758Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8867991Z     
2025-05-07T20:32:01.8868183Z         if scale_ub is not None:
2025-05-07T20:32:01.8868459Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8868789Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8869101Z             )
2025-05-07T20:32:01.8869347Z         else:
2025-05-07T20:32:01.8869548Z             scale_ub_tensor = None
2025-05-07T20:32:01.8869795Z     
2025-05-07T20:32:01.8870032Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8870338Z             op = silu_mul_quant
2025-05-07T20:32:01.8870589Z             if compiled:
2025-05-07T20:32:01.8870839Z                 op = torch.compile(op)
2025-05-07T20:32:01.8871365Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8871637Z     
2025-05-07T20:32:01.8871828Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.8872111Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.8872396Z     
2025-05-07T20:32:01.8872631Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8872967Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.8873253Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.8873576Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.8873951Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8874255Z     
2025-05-07T20:32:01.8874456Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.8874650Z 
2025-05-07T20:32:01.8874757Z moe/activation_test.py:126: 
2025-05-07T20:32:01.8875044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8875378Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.8875704Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8876495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.8877257Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.8877810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8878589Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8879286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.8880014Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8880773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.8881530Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8882264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.8883043Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.8883934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.8884476Z     fn()
2025-05-07T20:32:01.8885088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.8885745Z     self.fn.run(
2025-05-07T20:32:01.8886220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8886751Z     kernel = self.compile(
2025-05-07T20:32:01.8887300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8887968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8888365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8888593Z 
2025-05-07T20:32:01.8888802Z self = <triton.compiler.compiler.ASTSource object at 0x7fd396427cd0>
2025-05-07T20:32:01.8889885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8891309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd397055440>}
2025-05-07T20:32:01.8892652Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8893675Z context = <triton._C.libtriton.ir.context object at 0x7fd396414ff0>
2025-05-07T20:32:01.8893968Z 
2025-05-07T20:32:01.8894136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8894658Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8894770Z                            module_map=module_map)
2025-05-07T20:32:01.8894943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8895050Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.8895126Z E       ^
2025-05-07T20:32:01.8895490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8895495Z 
2025-05-07T20:32:01.8895913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8895918Z 
2025-05-07T20:32:01.8896027Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8896249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8896329Z     T=4096,
2025-05-07T20:32:01.8896412Z     D=5120,
2025-05-07T20:32:01.8896495Z     scale_ub=None,
2025-05-07T20:32:01.8896583Z     contiguous=False,
2025-05-07T20:32:01.8896676Z     compiled=False,
2025-05-07T20:32:01.8896752Z )
2025-05-07T20:32:01.8896973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8897270Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.8897276Z 
2025-05-07T20:32:01.8897354Z     @given(
2025-05-07T20:32:01.8897482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8897581Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8897696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8897818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8897932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8898010Z     )
2025-05-07T20:32:01.8898263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8898358Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8898436Z         self,
2025-05-07T20:32:01.8898522Z         T: int,
2025-05-07T20:32:01.8898597Z         D: int,
2025-05-07T20:32:01.8898703Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8898795Z         contiguous: bool,
2025-05-07T20:32:01.8898931Z         compiled: bool,
2025-05-07T20:32:01.8899021Z     ) -> None:
2025-05-07T20:32:01.8899116Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8899189Z     
2025-05-07T20:32:01.8899367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8899441Z     
2025-05-07T20:32:01.8899533Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8899666Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8899756Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8899836Z         x0 = x[:, :D]
2025-05-07T20:32:01.8899923Z         x1 = x[:, D:]
2025-05-07T20:32:01.8899996Z     
2025-05-07T20:32:01.8900084Z         if contiguous:
2025-05-07T20:32:01.8900177Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8900264Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8900343Z     
2025-05-07T20:32:01.8900432Z         if scale_ub is not None:
2025-05-07T20:32:01.8900537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8900686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8900806Z             )
2025-05-07T20:32:01.8900882Z         else:
2025-05-07T20:32:01.8900987Z             scale_ub_tensor = None
2025-05-07T20:32:01.8901060Z     
2025-05-07T20:32:01.8901189Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8901288Z             op = silu_mul_quant
2025-05-07T20:32:01.8901372Z             if compiled:
2025-05-07T20:32:01.8901479Z                 op = torch.compile(op)
2025-05-07T20:32:01.8901584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8901653Z     
2025-05-07T20:32:01.8901758Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.8901762Z 
2025-05-07T20:32:01.8901858Z moe/activation_test.py:117: 
2025-05-07T20:32:01.8901986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8902090Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.8902190Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8902703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.8902806Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.8903171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8903403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8903747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8903841Z     kernel = self.compile(
2025-05-07T20:32:01.8904237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8904411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8904621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8904633Z 
2025-05-07T20:32:01.8904838Z self = <triton.compiler.compiler.ASTSource object at 0x7fd396476b50>
2025-05-07T20:32:01.8905610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8906119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396b53b00>}
2025-05-07T20:32:01.8906872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8907070Z context = <triton._C.libtriton.ir.context object at 0x7fd3965d57f0>
2025-05-07T20:32:01.8907074Z 
2025-05-07T20:32:01.8907292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8907557Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8907669Z                            module_map=module_map)
2025-05-07T20:32:01.8907831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8907937Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.8908012Z E       ^
2025-05-07T20:32:01.8908371Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8908377Z 
2025-05-07T20:32:01.8908799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8908804Z 
2025-05-07T20:32:01.8908906Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8909136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8909219Z     T=4096,
2025-05-07T20:32:01.8909339Z     D=7168,
2025-05-07T20:32:01.8909426Z     scale_ub=None,
2025-05-07T20:32:01.8909513Z     contiguous=False,
2025-05-07T20:32:01.8909596Z     compiled=False,
2025-05-07T20:32:01.8909676Z )
2025-05-07T20:32:01.8909893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8910068Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.8910073Z 
2025-05-07T20:32:01.8910154Z     @given(
2025-05-07T20:32:01.8910270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8910369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8910491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8910606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8910725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8910799Z     )
2025-05-07T20:32:01.8911048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8911156Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8911230Z         self,
2025-05-07T20:32:01.8911306Z         T: int,
2025-05-07T20:32:01.8911388Z         D: int,
2025-05-07T20:32:01.8911484Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8911571Z         contiguous: bool,
2025-05-07T20:32:01.8911662Z         compiled: bool,
2025-05-07T20:32:01.8911740Z     ) -> None:
2025-05-07T20:32:01.8911834Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8912578Z     
2025-05-07T20:32:01.8912752Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8912833Z     
2025-05-07T20:32:01.8912926Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8913048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8913146Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8913226Z         x0 = x[:, :D]
2025-05-07T20:32:01.8913305Z         x1 = x[:, D:]
2025-05-07T20:32:01.8913385Z     
2025-05-07T20:32:01.8913555Z         if contiguous:
2025-05-07T20:32:01.8913649Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8913746Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8913818Z     
2025-05-07T20:32:01.8913918Z         if scale_ub is not None:
2025-05-07T20:32:01.8914024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8914158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8914243Z             )
2025-05-07T20:32:01.8914322Z         else:
2025-05-07T20:32:01.8914415Z             scale_ub_tensor = None
2025-05-07T20:32:01.8914494Z     
2025-05-07T20:32:01.8914624Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8914712Z             op = silu_mul_quant
2025-05-07T20:32:01.8914804Z             if compiled:
2025-05-07T20:32:01.8914902Z                 op = torch.compile(op)
2025-05-07T20:32:01.8915006Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8915086Z     
2025-05-07T20:32:01.8915226Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.8915234Z 
2025-05-07T20:32:01.8915336Z moe/activation_test.py:117: 
2025-05-07T20:32:01.8915464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8915562Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.8915668Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8916174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.8916271Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.8916642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8916865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8917214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8917314Z     kernel = self.compile(
2025-05-07T20:32:01.8917749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8917929Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8918054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8918059Z 
2025-05-07T20:32:01.8918274Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3964e2ad0>
2025-05-07T20:32:01.8919046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8919550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396b52660>}
2025-05-07T20:32:01.8920308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8920503Z context = <triton._C.libtriton.ir.context object at 0x7fd3964ef130>
2025-05-07T20:32:01.8920508Z 
2025-05-07T20:32:01.8920678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8920942Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8921049Z                            module_map=module_map)
2025-05-07T20:32:01.8921217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8921316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.8921394Z E       ^
2025-05-07T20:32:01.8921758Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8921762Z 
2025-05-07T20:32:01.8922283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8922292Z 
2025-05-07T20:32:01.8922404Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8922627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8922706Z     T=128,
2025-05-07T20:32:01.8922791Z     D=7168,
2025-05-07T20:32:01.8922868Z     scale_ub=None,
2025-05-07T20:32:01.8922958Z     contiguous=False,
2025-05-07T20:32:01.8923037Z     compiled=True,
2025-05-07T20:32:01.8923112Z )
2025-05-07T20:32:01.8923472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8923643Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.8923648Z 
2025-05-07T20:32:01.8923725Z     @given(
2025-05-07T20:32:01.8923854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8923954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8924145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8924271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8924383Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8924465Z     )
2025-05-07T20:32:01.8924711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8924803Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8924882Z         self,
2025-05-07T20:32:01.8924959Z         T: int,
2025-05-07T20:32:01.8925036Z         D: int,
2025-05-07T20:32:01.8925137Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8925226Z         contiguous: bool,
2025-05-07T20:32:01.8925310Z         compiled: bool,
2025-05-07T20:32:01.8925396Z     ) -> None:
2025-05-07T20:32:01.8925492Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8925565Z     
2025-05-07T20:32:01.8925745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8925823Z     
2025-05-07T20:32:01.8925989Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8926122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8926211Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8926298Z         x0 = x[:, :D]
2025-05-07T20:32:01.8926378Z         x1 = x[:, D:]
2025-05-07T20:32:01.8926453Z     
2025-05-07T20:32:01.8926543Z         if contiguous:
2025-05-07T20:32:01.8926633Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8926721Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8926803Z     
2025-05-07T20:32:01.8926894Z         if scale_ub is not None:
2025-05-07T20:32:01.8927002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8927146Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8927226Z             )
2025-05-07T20:32:01.8927306Z         else:
2025-05-07T20:32:01.8927405Z             scale_ub_tensor = None
2025-05-07T20:32:01.8927480Z     
2025-05-07T20:32:01.8927623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8927717Z             op = silu_mul_quant
2025-05-07T20:32:01.8927800Z             if compiled:
2025-05-07T20:32:01.8927912Z                 op = torch.compile(op)
2025-05-07T20:32:01.8928018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8928091Z     
2025-05-07T20:32:01.8928194Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.8928316Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.8928389Z     
2025-05-07T20:32:01.8928533Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8928634Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.8928733Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.8928866Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.8929008Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8929093Z     
2025-05-07T20:32:01.8929193Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.8929202Z 
2025-05-07T20:32:01.8929379Z moe/activation_test.py:126: 
2025-05-07T20:32:01.8929516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8929622Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.8929756Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8930329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.8930430Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.8930801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8931027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8931401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.8931671Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8932115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.8932378Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8932758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.8932926Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.8933279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.8933357Z     fn()
2025-05-07T20:32:01.8933763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.8933853Z     self.fn.run(
2025-05-07T20:32:01.8934204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8934348Z     kernel = self.compile(
2025-05-07T20:32:01.8934733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8934909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8935041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8935046Z 
2025-05-07T20:32:01.8935250Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301e31110>
2025-05-07T20:32:01.8936032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8936540Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3968ce480>}
2025-05-07T20:32:01.8937298Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8937493Z context = <triton._C.libtriton.ir.context object at 0x7fd301e82170>
2025-05-07T20:32:01.8937498Z 
2025-05-07T20:32:01.8937663Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8937933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8938040Z                            module_map=module_map)
2025-05-07T20:32:01.8938200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8938308Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.8938619Z E       ^
2025-05-07T20:32:01.8939219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8939241Z 
2025-05-07T20:32:01.8939665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8939670Z 
2025-05-07T20:32:01.8939774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8940005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8940081Z     T=128,
2025-05-07T20:32:01.8940158Z     D=7168,
2025-05-07T20:32:01.8940248Z     scale_ub=None,
2025-05-07T20:32:01.8940334Z     contiguous=False,
2025-05-07T20:32:01.8940421Z     compiled=False,
2025-05-07T20:32:01.8940500Z )
2025-05-07T20:32:01.8940718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8940893Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.8940898Z 
2025-05-07T20:32:01.8940977Z     @given(
2025-05-07T20:32:01.8941102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8941277Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8941391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8941506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8941626Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8941698Z     )
2025-05-07T20:32:01.8941949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8942040Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8942114Z         self,
2025-05-07T20:32:01.8942199Z         T: int,
2025-05-07T20:32:01.8942273Z         D: int,
2025-05-07T20:32:01.8942371Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8942465Z         contiguous: bool,
2025-05-07T20:32:01.8942550Z         compiled: bool,
2025-05-07T20:32:01.8942626Z     ) -> None:
2025-05-07T20:32:01.8942725Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8942797Z     
2025-05-07T20:32:01.8942975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8943128Z     
2025-05-07T20:32:01.8943220Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8943344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8943439Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8943518Z         x0 = x[:, :D]
2025-05-07T20:32:01.8943604Z         x1 = x[:, D:]
2025-05-07T20:32:01.8943677Z     
2025-05-07T20:32:01.8943760Z         if contiguous:
2025-05-07T20:32:01.8943857Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8943947Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8944019Z     
2025-05-07T20:32:01.8944123Z         if scale_ub is not None:
2025-05-07T20:32:01.8944228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8944364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8944447Z             )
2025-05-07T20:32:01.8944525Z         else:
2025-05-07T20:32:01.8944622Z             scale_ub_tensor = None
2025-05-07T20:32:01.8944708Z     
2025-05-07T20:32:01.8944837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8944933Z             op = silu_mul_quant
2025-05-07T20:32:01.8945018Z             if compiled:
2025-05-07T20:32:01.8945117Z                 op = torch.compile(op)
2025-05-07T20:32:01.8945231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8945304Z     
2025-05-07T20:32:01.8945396Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.8945401Z 
2025-05-07T20:32:01.8945505Z moe/activation_test.py:117: 
2025-05-07T20:32:01.8945634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8945735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.8945839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8946345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.8946529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.8946900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8947123Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8947479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8947571Z     kernel = self.compile(
2025-05-07T20:32:01.8947960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8948141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8948267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8948271Z 
2025-05-07T20:32:01.8948481Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301efee50>
2025-05-07T20:32:01.8949260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8949809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3968cf240>}
2025-05-07T20:32:01.8950566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8959467Z context = <triton._C.libtriton.ir.context object at 0x7fd301ed7430>
2025-05-07T20:32:01.8959478Z 
2025-05-07T20:32:01.8959672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8959938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8960061Z                            module_map=module_map)
2025-05-07T20:32:01.8960300Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8960395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.8960473Z E       ^
2025-05-07T20:32:01.8960832Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8960837Z 
2025-05-07T20:32:01.8961260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8961265Z 
2025-05-07T20:32:01.8961364Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8961587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8961664Z     T=4096,
2025-05-07T20:32:01.8961736Z     D=5120,
2025-05-07T20:32:01.8961815Z     scale_ub=1200.0,
2025-05-07T20:32:01.8961899Z     contiguous=True,
2025-05-07T20:32:01.8961979Z     compiled=False,
2025-05-07T20:32:01.8962057Z )
2025-05-07T20:32:01.8962278Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8962451Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.8962456Z 
2025-05-07T20:32:01.8962533Z     @given(
2025-05-07T20:32:01.8962649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8962744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8962858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8962971Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8963080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8963154Z     )
2025-05-07T20:32:01.8963532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8963623Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8963698Z         self,
2025-05-07T20:32:01.8963772Z         T: int,
2025-05-07T20:32:01.8963935Z         D: int,
2025-05-07T20:32:01.8964036Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8964122Z         contiguous: bool,
2025-05-07T20:32:01.8964206Z         compiled: bool,
2025-05-07T20:32:01.8964283Z     ) -> None:
2025-05-07T20:32:01.8964376Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8964449Z     
2025-05-07T20:32:01.8964619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8964691Z     
2025-05-07T20:32:01.8964783Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8964905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8964994Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8965088Z         x0 = x[:, :D]
2025-05-07T20:32:01.8965169Z         x1 = x[:, D:]
2025-05-07T20:32:01.8965242Z     
2025-05-07T20:32:01.8965334Z         if contiguous:
2025-05-07T20:32:01.8965428Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8965529Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8965604Z     
2025-05-07T20:32:01.8965780Z         if scale_ub is not None:
2025-05-07T20:32:01.8965898Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8966034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8966112Z             )
2025-05-07T20:32:01.8966197Z         else:
2025-05-07T20:32:01.8966292Z             scale_ub_tensor = None
2025-05-07T20:32:01.8966371Z     
2025-05-07T20:32:01.8966510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8966601Z             op = silu_mul_quant
2025-05-07T20:32:01.8966687Z             if compiled:
2025-05-07T20:32:01.8966795Z                 op = torch.compile(op)
2025-05-07T20:32:01.8966907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8966993Z     
2025-05-07T20:32:01.8967090Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.8967095Z 
2025-05-07T20:32:01.8967197Z moe/activation_test.py:117: 
2025-05-07T20:32:01.8967337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8967498Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.8967601Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8968117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.8968217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.8968588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8968813Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8969159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8969260Z     kernel = self.compile(
2025-05-07T20:32:01.8969648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8969834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8969972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8969976Z 
2025-05-07T20:32:01.8970181Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301609850>
2025-05-07T20:32:01.8970970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8971474Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3968cfba0>}
2025-05-07T20:32:01.8972233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8972503Z context = <triton._C.libtriton.ir.context object at 0x7fd3016480f0>
2025-05-07T20:32:01.8972510Z 
2025-05-07T20:32:01.8972677Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8972949Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8973058Z                            module_map=module_map)
2025-05-07T20:32:01.8973228Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8973330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.8973405Z E       ^
2025-05-07T20:32:01.8973798Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8973803Z 
2025-05-07T20:32:01.8974245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8974249Z 
2025-05-07T20:32:01.8974355Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8974635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8974714Z     T=1,
2025-05-07T20:32:01.8974804Z     D=5120,
2025-05-07T20:32:01.8974886Z     scale_ub=None,
2025-05-07T20:32:01.8974972Z     contiguous=True,
2025-05-07T20:32:01.8975061Z     compiled=True,
2025-05-07T20:32:01.8975133Z )
2025-05-07T20:32:01.8975355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8975527Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.8975531Z 
2025-05-07T20:32:01.8975607Z     @given(
2025-05-07T20:32:01.8975728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8975838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8975958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8976084Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8976205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8976327Z     )
2025-05-07T20:32:01.8976581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8976675Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8976750Z         self,
2025-05-07T20:32:01.8976838Z         T: int,
2025-05-07T20:32:01.8976916Z         D: int,
2025-05-07T20:32:01.8977013Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8977113Z         contiguous: bool,
2025-05-07T20:32:01.8977202Z         compiled: bool,
2025-05-07T20:32:01.8977282Z     ) -> None:
2025-05-07T20:32:01.8977389Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8977467Z     
2025-05-07T20:32:01.8977646Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8977721Z     
2025-05-07T20:32:01.8977816Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8977952Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8978042Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8978131Z         x0 = x[:, :D]
2025-05-07T20:32:01.8978222Z         x1 = x[:, D:]
2025-05-07T20:32:01.8978296Z     
2025-05-07T20:32:01.8978379Z         if contiguous:
2025-05-07T20:32:01.8978479Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8978569Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8978647Z     
2025-05-07T20:32:01.8978747Z         if scale_ub is not None:
2025-05-07T20:32:01.8978854Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8979003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8979080Z             )
2025-05-07T20:32:01.8979156Z         else:
2025-05-07T20:32:01.8979258Z             scale_ub_tensor = None
2025-05-07T20:32:01.8979333Z     
2025-05-07T20:32:01.8979464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8979566Z             op = silu_mul_quant
2025-05-07T20:32:01.8979650Z             if compiled:
2025-05-07T20:32:01.8979750Z                 op = torch.compile(op)
2025-05-07T20:32:01.8979949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8980024Z     
2025-05-07T20:32:01.8980118Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.8980247Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.8980318Z     
2025-05-07T20:32:01.8980455Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8980565Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.8980667Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.8980801Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.8980941Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8981015Z     
2025-05-07T20:32:01.8981126Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.8981131Z 
2025-05-07T20:32:01.8981227Z moe/activation_test.py:126: 
2025-05-07T20:32:01.8981363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8981482Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.8981659Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8982223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.8982333Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.8982699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8982921Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8983301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.8983558Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8983966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.8984269Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.8984650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.8984824Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.8985168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.8985250Z     fn()
2025-05-07T20:32:01.8985655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.8985738Z     self.fn.run(
2025-05-07T20:32:01.8986090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8986187Z     kernel = self.compile(
2025-05-07T20:32:01.8986575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8986763Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8986892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8986896Z 
2025-05-07T20:32:01.8987108Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3016c1cd0>
2025-05-07T20:32:01.8987887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8988389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3969c3560>}
2025-05-07T20:32:01.8989230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8989428Z context = <triton._C.libtriton.ir.context object at 0x7fd301665ff0>
2025-05-07T20:32:01.8989433Z 
2025-05-07T20:32:01.8989603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8989868Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8989973Z                            module_map=module_map)
2025-05-07T20:32:01.8990141Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8990244Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.8990328Z E       ^
2025-05-07T20:32:01.8990689Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8990694Z 
2025-05-07T20:32:01.8991117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8991164Z 
2025-05-07T20:32:01.8991275Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8991497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8991582Z     T=2048,
2025-05-07T20:32:01.8991660Z     D=5120,
2025-05-07T20:32:01.8991743Z     scale_ub=None,
2025-05-07T20:32:01.8991834Z     contiguous=True,
2025-05-07T20:32:01.8991915Z     compiled=True,
2025-05-07T20:32:01.8991987Z )
2025-05-07T20:32:01.8992213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8992383Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.8992388Z 
2025-05-07T20:32:01.8992461Z     @given(
2025-05-07T20:32:01.8992589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8992686Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8992809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8992935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8993090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8993173Z     )
2025-05-07T20:32:01.8993419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8993510Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8993588Z         self,
2025-05-07T20:32:01.8993662Z         T: int,
2025-05-07T20:32:01.8993738Z         D: int,
2025-05-07T20:32:01.8993840Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8993929Z         contiguous: bool,
2025-05-07T20:32:01.8994014Z         compiled: bool,
2025-05-07T20:32:01.8994097Z     ) -> None:
2025-05-07T20:32:01.8994191Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8994271Z     
2025-05-07T20:32:01.8994441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8994517Z     
2025-05-07T20:32:01.8994619Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8994749Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8994844Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8994928Z         x0 = x[:, :D]
2025-05-07T20:32:01.8995009Z         x1 = x[:, D:]
2025-05-07T20:32:01.8995082Z     
2025-05-07T20:32:01.8995169Z         if contiguous:
2025-05-07T20:32:01.8995260Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8995348Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8995428Z     
2025-05-07T20:32:01.8995518Z         if scale_ub is not None:
2025-05-07T20:32:01.8995622Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8995766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8995837Z             )
2025-05-07T20:32:01.8995924Z         else:
2025-05-07T20:32:01.8996017Z             scale_ub_tensor = None
2025-05-07T20:32:01.8996088Z     
2025-05-07T20:32:01.8996225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8996315Z             op = silu_mul_quant
2025-05-07T20:32:01.8996509Z             if compiled:
2025-05-07T20:32:01.8996621Z                 op = torch.compile(op)
2025-05-07T20:32:01.8996729Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8996801Z     
2025-05-07T20:32:01.8996898Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.8997019Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.8997092Z     
2025-05-07T20:32:01.8997239Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8997341Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.8997448Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.8997570Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.8997709Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8997790Z     
2025-05-07T20:32:01.8997889Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.8997894Z 
2025-05-07T20:32:01.8997992Z moe/activation_test.py:126: 
2025-05-07T20:32:01.8998135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8998286Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.8998427Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.8998993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.8999095Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.8999466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8999690Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9000063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.9000326Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9000738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.9001041Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9001420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.9001589Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.9001941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.9002018Z     fn()
2025-05-07T20:32:01.9002432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.9002515Z     self.fn.run(
2025-05-07T20:32:01.9002857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9002962Z     kernel = self.compile(
2025-05-07T20:32:01.9003516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9003693Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9003826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9003831Z 
2025-05-07T20:32:01.9004038Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3017c3d90>
2025-05-07T20:32:01.9004818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9005320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3969c2f20>}
2025-05-07T20:32:01.9006162Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9006362Z context = <triton._C.libtriton.ir.context object at 0x7fd30174bb30>
2025-05-07T20:32:01.9006367Z 
2025-05-07T20:32:01.9006533Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9006805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9006919Z                            module_map=module_map)
2025-05-07T20:32:01.9007089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9007190Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.9007266Z E       ^
2025-05-07T20:32:01.9007628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9007633Z 
2025-05-07T20:32:01.9008097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9008103Z 
2025-05-07T20:32:01.9008210Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9008439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9008517Z     T=128,
2025-05-07T20:32:01.9008595Z     D=5120,
2025-05-07T20:32:01.9008676Z     scale_ub=None,
2025-05-07T20:32:01.9008760Z     contiguous=True,
2025-05-07T20:32:01.9008851Z     compiled=True,
2025-05-07T20:32:01.9008919Z )
2025-05-07T20:32:01.9009143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9009318Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.9009324Z 
2025-05-07T20:32:01.9009401Z     @given(
2025-05-07T20:32:01.9009519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9009622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9009879Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9010004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9010117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9010194Z     )
2025-05-07T20:32:01.9010444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9010536Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9010613Z         self,
2025-05-07T20:32:01.9010696Z         T: int,
2025-05-07T20:32:01.9010771Z         D: int,
2025-05-07T20:32:01.9010869Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9010966Z         contiguous: bool,
2025-05-07T20:32:01.9011053Z         compiled: bool,
2025-05-07T20:32:01.9011130Z     ) -> None:
2025-05-07T20:32:01.9011233Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9011307Z     
2025-05-07T20:32:01.9011488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9011564Z     
2025-05-07T20:32:01.9011663Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9011793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9011882Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9011962Z         x0 = x[:, :D]
2025-05-07T20:32:01.9012048Z         x1 = x[:, D:]
2025-05-07T20:32:01.9012120Z     
2025-05-07T20:32:01.9012202Z         if contiguous:
2025-05-07T20:32:01.9012301Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9012389Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9012461Z     
2025-05-07T20:32:01.9012561Z         if scale_ub is not None:
2025-05-07T20:32:01.9012668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9012811Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9012884Z             )
2025-05-07T20:32:01.9012960Z         else:
2025-05-07T20:32:01.9013060Z             scale_ub_tensor = None
2025-05-07T20:32:01.9013133Z     
2025-05-07T20:32:01.9013344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9013450Z             op = silu_mul_quant
2025-05-07T20:32:01.9013536Z             if compiled:
2025-05-07T20:32:01.9013641Z                 op = torch.compile(op)
2025-05-07T20:32:01.9013754Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9013827Z     
2025-05-07T20:32:01.9013918Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.9014045Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.9014119Z     
2025-05-07T20:32:01.9014255Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9014362Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.9014460Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.9014587Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.9014727Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9014799Z     
2025-05-07T20:32:01.9014909Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.9014966Z 
2025-05-07T20:32:01.9015069Z moe/activation_test.py:126: 
2025-05-07T20:32:01.9015199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9015311Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.9015443Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9016018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.9016117Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.9016486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9016716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9017088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.9017356Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9017807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.9018062Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9018448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.9018615Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.9018960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.9019045Z     fn()
2025-05-07T20:32:01.9019449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.9019535Z     self.fn.run(
2025-05-07T20:32:01.9019884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9019981Z     kernel = self.compile(
2025-05-07T20:32:01.9020374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9020552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9020686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9020691Z 
2025-05-07T20:32:01.9020905Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301435290>
2025-05-07T20:32:01.9021683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9022273Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396163c40>}
2025-05-07T20:32:01.9023037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9023229Z context = <triton._C.libtriton.ir.context object at 0x7fd30143d030>
2025-05-07T20:32:01.9023234Z 
2025-05-07T20:32:01.9023409Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9023675Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9023790Z                            module_map=module_map)
2025-05-07T20:32:01.9023953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9024053Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.9024139Z E       ^
2025-05-07T20:32:01.9024505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9024551Z 
2025-05-07T20:32:01.9024971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9024982Z 
2025-05-07T20:32:01.9025087Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9025311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9025396Z     T=4096,
2025-05-07T20:32:01.9025470Z     D=5120,
2025-05-07T20:32:01.9025549Z     scale_ub=None,
2025-05-07T20:32:01.9025643Z     contiguous=True,
2025-05-07T20:32:01.9025725Z     compiled=True,
2025-05-07T20:32:01.9025801Z )
2025-05-07T20:32:01.9026028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9026200Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.9026205Z 
2025-05-07T20:32:01.9026290Z     @given(
2025-05-07T20:32:01.9026420Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9026566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9026688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9026805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9026920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9027000Z     )
2025-05-07T20:32:01.9027246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9027340Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9027420Z         self,
2025-05-07T20:32:01.9027497Z         T: int,
2025-05-07T20:32:01.9027575Z         D: int,
2025-05-07T20:32:01.9027689Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9027780Z         contiguous: bool,
2025-05-07T20:32:01.9027876Z         compiled: bool,
2025-05-07T20:32:01.9027952Z     ) -> None:
2025-05-07T20:32:01.9028049Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9028127Z     
2025-05-07T20:32:01.9028304Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9028380Z     
2025-05-07T20:32:01.9028479Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9028602Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9028688Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9028776Z         x0 = x[:, :D]
2025-05-07T20:32:01.9028859Z         x1 = x[:, D:]
2025-05-07T20:32:01.9028932Z     
2025-05-07T20:32:01.9029022Z         if contiguous:
2025-05-07T20:32:01.9029112Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9029208Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9029280Z     
2025-05-07T20:32:01.9029372Z         if scale_ub is not None:
2025-05-07T20:32:01.9029486Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9029619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9029695Z             )
2025-05-07T20:32:01.9029782Z         else:
2025-05-07T20:32:01.9029986Z             scale_ub_tensor = None
2025-05-07T20:32:01.9030070Z     
2025-05-07T20:32:01.9030207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9030300Z             op = silu_mul_quant
2025-05-07T20:32:01.9030388Z             if compiled:
2025-05-07T20:32:01.9030493Z                 op = torch.compile(op)
2025-05-07T20:32:01.9030598Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9030668Z     
2025-05-07T20:32:01.9030767Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.9030893Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.9030971Z     
2025-05-07T20:32:01.9031106Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9031208Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.9031314Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.9031434Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.9031574Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9031713Z     
2025-05-07T20:32:01.9031817Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.9031822Z 
2025-05-07T20:32:01.9031933Z moe/activation_test.py:126: 
2025-05-07T20:32:01.9032065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9032175Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.9032323Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9032894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.9032997Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.9033374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9033599Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9033989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.9034302Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9034705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.9034974Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9035358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.9035538Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.9035884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.9035963Z     fn()
2025-05-07T20:32:01.9036380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.9036468Z     self.fn.run(
2025-05-07T20:32:01.9036813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9036918Z     kernel = self.compile(
2025-05-07T20:32:01.9037304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9037492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9037620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9037624Z 
2025-05-07T20:32:01.9037831Z self = <triton.compiler.compiler.ASTSource object at 0x7fd301a4b7d0>
2025-05-07T20:32:01.9038967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9039691Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd301b49760>}
2025-05-07T20:32:01.9040464Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9040656Z context = <triton._C.libtriton.ir.context object at 0x7fd301a4f570>
2025-05-07T20:32:01.9040661Z 
2025-05-07T20:32:01.9040825Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9041101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9041208Z                            module_map=module_map)
2025-05-07T20:32:01.9041376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9041479Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.9041626Z E       ^
2025-05-07T20:32:01.9041995Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9042000Z 
2025-05-07T20:32:01.9042420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9042424Z 
2025-05-07T20:32:01.9042538Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9042760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9042838Z     T=16384,
2025-05-07T20:32:01.9042926Z     D=5120,
2025-05-07T20:32:01.9043008Z     scale_ub=None,
2025-05-07T20:32:01.9043090Z     contiguous=True,
2025-05-07T20:32:01.9043177Z     compiled=True,
2025-05-07T20:32:01.9043406Z )
2025-05-07T20:32:01.9043627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9043817Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.9043901Z 
2025-05-07T20:32:01.9043977Z     @given(
2025-05-07T20:32:01.9044107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9044206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9044321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9044446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9044560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9044633Z     )
2025-05-07T20:32:01.9044886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9044980Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9045054Z         self,
2025-05-07T20:32:01.9045139Z         T: int,
2025-05-07T20:32:01.9045214Z         D: int,
2025-05-07T20:32:01.9045311Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9045407Z         contiguous: bool,
2025-05-07T20:32:01.9045494Z         compiled: bool,
2025-05-07T20:32:01.9045582Z     ) -> None:
2025-05-07T20:32:01.9045687Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9045759Z     
2025-05-07T20:32:01.9045936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9046012Z     
2025-05-07T20:32:01.9046106Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9046239Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9046328Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9046409Z         x0 = x[:, :D]
2025-05-07T20:32:01.9046492Z         x1 = x[:, D:]
2025-05-07T20:32:01.9046565Z     
2025-05-07T20:32:01.9046648Z         if contiguous:
2025-05-07T20:32:01.9046746Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9046834Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9046906Z     
2025-05-07T20:32:01.9047003Z         if scale_ub is not None:
2025-05-07T20:32:01.9047109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9047248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9047413Z             )
2025-05-07T20:32:01.9047495Z         else:
2025-05-07T20:32:01.9047599Z             scale_ub_tensor = None
2025-05-07T20:32:01.9047670Z     
2025-05-07T20:32:01.9047804Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9047901Z             op = silu_mul_quant
2025-05-07T20:32:01.9047987Z             if compiled:
2025-05-07T20:32:01.9048087Z                 op = torch.compile(op)
2025-05-07T20:32:01.9048199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9048270Z     
2025-05-07T20:32:01.9048360Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.9048484Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.9048557Z     
2025-05-07T20:32:01.9048700Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9048800Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.9048898Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.9049030Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.9049216Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9049288Z     
2025-05-07T20:32:01.9049399Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.9049404Z 
2025-05-07T20:32:01.9049502Z moe/activation_test.py:126: 
2025-05-07T20:32:01.9049636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9049744Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.9049877Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9050452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.9050553Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.9050917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9051152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9051574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.9051835Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9052236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.9052490Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9052872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.9053038Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.9053388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.9053467Z     fn()
2025-05-07T20:32:01.9053877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.9053971Z     self.fn.run(
2025-05-07T20:32:01.9054315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9054409Z     kernel = self.compile(
2025-05-07T20:32:01.9054805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9054983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9055120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9055124Z 
2025-05-07T20:32:01.9055338Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3013d98d0>
2025-05-07T20:32:01.9056201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9056717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd30110b600>}
2025-05-07T20:32:01.9057474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9057674Z context = <triton._C.libtriton.ir.context object at 0x7fd3013dd730>
2025-05-07T20:32:01.9057679Z 
2025-05-07T20:32:01.9057846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9058113Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9058228Z                            module_map=module_map)
2025-05-07T20:32:01.9058403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9058559Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.9058639Z E       ^
2025-05-07T20:32:01.9058996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9059000Z 
2025-05-07T20:32:01.9059425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9059430Z 
2025-05-07T20:32:01.9059530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9059765Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9059840Z     T=1,
2025-05-07T20:32:01.9059917Z     D=5120,
2025-05-07T20:32:01.9060009Z     scale_ub=1200.0,
2025-05-07T20:32:01.9060094Z     contiguous=True,
2025-05-07T20:32:01.9060175Z     compiled=True,
2025-05-07T20:32:01.9060256Z )
2025-05-07T20:32:01.9060481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9060720Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9060724Z 
2025-05-07T20:32:01.9060809Z     @given(
2025-05-07T20:32:01.9060929Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9061031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9061146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9061263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9061382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9061457Z     )
2025-05-07T20:32:01.9061704Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9061802Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9061875Z         self,
2025-05-07T20:32:01.9061952Z         T: int,
2025-05-07T20:32:01.9062034Z         D: int,
2025-05-07T20:32:01.9062133Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9062228Z         contiguous: bool,
2025-05-07T20:32:01.9062322Z         compiled: bool,
2025-05-07T20:32:01.9062401Z     ) -> None:
2025-05-07T20:32:01.9062503Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9062571Z     
2025-05-07T20:32:01.9062741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9062822Z     
2025-05-07T20:32:01.9062913Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9063040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9063135Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9063215Z         x0 = x[:, :D]
2025-05-07T20:32:01.9063298Z         x1 = x[:, D:]
2025-05-07T20:32:01.9063379Z     
2025-05-07T20:32:01.9063460Z         if contiguous:
2025-05-07T20:32:01.9063548Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9063644Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9063715Z     
2025-05-07T20:32:01.9063808Z         if scale_ub is not None:
2025-05-07T20:32:01.9064014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9064175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9064274Z             )
2025-05-07T20:32:01.9064355Z         else:
2025-05-07T20:32:01.9064446Z             scale_ub_tensor = None
2025-05-07T20:32:01.9064529Z     
2025-05-07T20:32:01.9064661Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9064751Z             op = silu_mul_quant
2025-05-07T20:32:01.9064840Z             if compiled:
2025-05-07T20:32:01.9064939Z                 op = torch.compile(op)
2025-05-07T20:32:01.9065045Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9065126Z     
2025-05-07T20:32:01.9065215Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9065220Z 
2025-05-07T20:32:01.9065324Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9065454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9065554Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9065668Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9066112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9066205Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9066716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9066817Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9067187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9067412Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9067757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9067858Z     kernel = self.compile(
2025-05-07T20:32:01.9068257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9068480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9068613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9068617Z 
2025-05-07T20:32:01.9068822Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300f62210>
2025-05-07T20:32:01.9069610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9070116Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd301c96ac0>}
2025-05-07T20:32:01.9070888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9071084Z context = <triton._C.libtriton.ir.context object at 0x7fd300ffa770>
2025-05-07T20:32:01.9071089Z 
2025-05-07T20:32:01.9071256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9071533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9071640Z                            module_map=module_map)
2025-05-07T20:32:01.9071803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9071910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9071990Z E       ^
2025-05-07T20:32:01.9072355Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9072360Z 
2025-05-07T20:32:01.9072782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9072868Z 
2025-05-07T20:32:01.9072974Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9073205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9073284Z     T=1,
2025-05-07T20:32:01.9073368Z     D=5120,
2025-05-07T20:32:01.9073447Z     scale_ub=None,
2025-05-07T20:32:01.9073534Z     contiguous=False,
2025-05-07T20:32:01.9073626Z     compiled=True,
2025-05-07T20:32:01.9073701Z )
2025-05-07T20:32:01.9073921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9074092Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9074097Z 
2025-05-07T20:32:01.9074170Z     @given(
2025-05-07T20:32:01.9074290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9074395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9074511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9074640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9074796Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9074872Z     )
2025-05-07T20:32:01.9075125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9075222Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9075295Z         self,
2025-05-07T20:32:01.9075377Z         T: int,
2025-05-07T20:32:01.9075453Z         D: int,
2025-05-07T20:32:01.9075549Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9075647Z         contiguous: bool,
2025-05-07T20:32:01.9075731Z         compiled: bool,
2025-05-07T20:32:01.9075806Z     ) -> None:
2025-05-07T20:32:01.9075910Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9075983Z     
2025-05-07T20:32:01.9076160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9076235Z     
2025-05-07T20:32:01.9076327Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9076463Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9076601Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9076679Z         x0 = x[:, :D]
2025-05-07T20:32:01.9076761Z         x1 = x[:, D:]
2025-05-07T20:32:01.9076833Z     
2025-05-07T20:32:01.9076915Z         if contiguous:
2025-05-07T20:32:01.9077015Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9077103Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9077174Z     
2025-05-07T20:32:01.9077269Z         if scale_ub is not None:
2025-05-07T20:32:01.9077377Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9077515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9077599Z             )
2025-05-07T20:32:01.9077676Z         else:
2025-05-07T20:32:01.9077776Z             scale_ub_tensor = None
2025-05-07T20:32:01.9077847Z     
2025-05-07T20:32:01.9077977Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9078074Z             op = silu_mul_quant
2025-05-07T20:32:01.9078160Z             if compiled:
2025-05-07T20:32:01.9078268Z                 op = torch.compile(op)
2025-05-07T20:32:01.9078381Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9078453Z     
2025-05-07T20:32:01.9078543Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.9078674Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.9078745Z     
2025-05-07T20:32:01.9078883Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9078988Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.9079086Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.9079213Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.9079353Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9079425Z     
2025-05-07T20:32:01.9079533Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.9079538Z 
2025-05-07T20:32:01.9079635Z moe/activation_test.py:126: 
2025-05-07T20:32:01.9079847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9079963Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.9080099Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9080674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.9080776Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.9081143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9081378Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9081751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.9082009Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9082422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.9082721Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9083107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.9083410Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.9083756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.9083845Z     fn()
2025-05-07T20:32:01.9090565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.9090671Z     self.fn.run(
2025-05-07T20:32:01.9091033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9091141Z     kernel = self.compile(
2025-05-07T20:32:01.9091628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9091819Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9091951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9091957Z 
2025-05-07T20:32:01.9092169Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300f5d890>
2025-05-07T20:32:01.9092961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9093467Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd301c97e20>}
2025-05-07T20:32:01.9094289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9094486Z context = <triton._C.libtriton.ir.context object at 0x7fd300f6fe70>
2025-05-07T20:32:01.9094491Z 
2025-05-07T20:32:01.9094657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9094934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9095049Z                            module_map=module_map)
2025-05-07T20:32:01.9095219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9095322Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.9095400Z E       ^
2025-05-07T20:32:01.9095770Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9095775Z 
2025-05-07T20:32:01.9096276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9096285Z 
2025-05-07T20:32:01.9096397Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9096621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9096704Z     T=1,
2025-05-07T20:32:01.9096793Z     D=5120,
2025-05-07T20:32:01.9096878Z     scale_ub=None,
2025-05-07T20:32:01.9096961Z     contiguous=True,
2025-05-07T20:32:01.9097059Z     compiled=False,
2025-05-07T20:32:01.9097131Z )
2025-05-07T20:32:01.9097353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9097530Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9097535Z 
2025-05-07T20:32:01.9097615Z     @given(
2025-05-07T20:32:01.9097745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9097841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9098033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9098165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9098278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9098353Z     )
2025-05-07T20:32:01.9098611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9098707Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9098788Z         self,
2025-05-07T20:32:01.9098871Z         T: int,
2025-05-07T20:32:01.9098947Z         D: int,
2025-05-07T20:32:01.9099055Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9099149Z         contiguous: bool,
2025-05-07T20:32:01.9099234Z         compiled: bool,
2025-05-07T20:32:01.9099320Z     ) -> None:
2025-05-07T20:32:01.9099418Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9099491Z     
2025-05-07T20:32:01.9099671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9099750Z     
2025-05-07T20:32:01.9099848Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9100036Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9100126Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9100208Z         x0 = x[:, :D]
2025-05-07T20:32:01.9100295Z         x1 = x[:, D:]
2025-05-07T20:32:01.9100367Z     
2025-05-07T20:32:01.9100454Z         if contiguous:
2025-05-07T20:32:01.9100555Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9100645Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9100726Z     
2025-05-07T20:32:01.9100821Z         if scale_ub is not None:
2025-05-07T20:32:01.9100932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9101078Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9101154Z             )
2025-05-07T20:32:01.9101232Z         else:
2025-05-07T20:32:01.9101334Z             scale_ub_tensor = None
2025-05-07T20:32:01.9101409Z     
2025-05-07T20:32:01.9101541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9101650Z             op = silu_mul_quant
2025-05-07T20:32:01.9101734Z             if compiled:
2025-05-07T20:32:01.9101838Z                 op = torch.compile(op)
2025-05-07T20:32:01.9101950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9102023Z     
2025-05-07T20:32:01.9102118Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9102123Z 
2025-05-07T20:32:01.9102219Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9102350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9102456Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9102556Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9103060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9103165Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9103615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9103854Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9104199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9104295Z     kernel = self.compile(
2025-05-07T20:32:01.9104689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9104870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9104996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9105009Z 
2025-05-07T20:32:01.9105213Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30016c9d0>
2025-05-07T20:32:01.9105997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9106550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3009f2de0>}
2025-05-07T20:32:01.9107305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9107506Z context = <triton._C.libtriton.ir.context object at 0x7fd30024bb30>
2025-05-07T20:32:01.9107510Z 
2025-05-07T20:32:01.9107676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9107940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9108055Z                            module_map=module_map)
2025-05-07T20:32:01.9108227Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9108375Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9108454Z E       ^
2025-05-07T20:32:01.9108815Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9108820Z 
2025-05-07T20:32:01.9109247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9109251Z 
2025-05-07T20:32:01.9109354Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9109581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9109670Z     T=128,
2025-05-07T20:32:01.9109749Z     D=5120,
2025-05-07T20:32:01.9109837Z     scale_ub=None,
2025-05-07T20:32:01.9109922Z     contiguous=False,
2025-05-07T20:32:01.9110004Z     compiled=True,
2025-05-07T20:32:01.9110087Z )
2025-05-07T20:32:01.9110305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9110483Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9110491Z 
2025-05-07T20:32:01.9110578Z     @given(
2025-05-07T20:32:01.9110696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9110799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9110923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9111039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9111162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9111237Z     )
2025-05-07T20:32:01.9111482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9111584Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9111661Z         self,
2025-05-07T20:32:01.9111738Z         T: int,
2025-05-07T20:32:01.9111827Z         D: int,
2025-05-07T20:32:01.9111925Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9112013Z         contiguous: bool,
2025-05-07T20:32:01.9112189Z         compiled: bool,
2025-05-07T20:32:01.9112277Z     ) -> None:
2025-05-07T20:32:01.9112371Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9112454Z     
2025-05-07T20:32:01.9112629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9112714Z     
2025-05-07T20:32:01.9112808Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9112935Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9113041Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9113125Z         x0 = x[:, :D]
2025-05-07T20:32:01.9113205Z         x1 = x[:, D:]
2025-05-07T20:32:01.9113282Z     
2025-05-07T20:32:01.9113368Z         if contiguous:
2025-05-07T20:32:01.9113461Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9113560Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9113636Z     
2025-05-07T20:32:01.9113727Z         if scale_ub is not None:
2025-05-07T20:32:01.9113839Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9114022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9114102Z             )
2025-05-07T20:32:01.9114191Z         else:
2025-05-07T20:32:01.9114285Z             scale_ub_tensor = None
2025-05-07T20:32:01.9114369Z     
2025-05-07T20:32:01.9114499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9114591Z             op = silu_mul_quant
2025-05-07T20:32:01.9114683Z             if compiled:
2025-05-07T20:32:01.9114784Z                 op = torch.compile(op)
2025-05-07T20:32:01.9114890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9114961Z     
2025-05-07T20:32:01.9115060Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9115065Z 
2025-05-07T20:32:01.9115163Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9115298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9115398Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9115500Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9115935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9116028Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9116530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9116635Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9117000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9117231Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9117574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9117668Z     kernel = self.compile(
2025-05-07T20:32:01.9118062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9118248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9118377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9118389Z 
2025-05-07T20:32:01.9118593Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300db49d0>
2025-05-07T20:32:01.9119365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9119873Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3009f27a0>}
2025-05-07T20:32:01.9120704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9120907Z context = <triton._C.libtriton.ir.context object at 0x7fd3002b9970>
2025-05-07T20:32:01.9120911Z 
2025-05-07T20:32:01.9121075Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9121341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9121451Z                            module_map=module_map)
2025-05-07T20:32:01.9121615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9121720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9121798Z E       ^
2025-05-07T20:32:01.9122155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9122159Z 
2025-05-07T20:32:01.9122585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9122633Z 
2025-05-07T20:32:01.9122747Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9122972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9123056Z     T=128,
2025-05-07T20:32:01.9123133Z     D=7168,
2025-05-07T20:32:01.9123320Z     scale_ub=1200.0,
2025-05-07T20:32:01.9123406Z     contiguous=False,
2025-05-07T20:32:01.9123486Z     compiled=False,
2025-05-07T20:32:01.9123569Z )
2025-05-07T20:32:01.9123787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9123960Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9123965Z 
2025-05-07T20:32:01.9124046Z     @given(
2025-05-07T20:32:01.9124166Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9124263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9124386Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9124502Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9124682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9124757Z     )
2025-05-07T20:32:01.9125001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9125101Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9125176Z         self,
2025-05-07T20:32:01.9125249Z         T: int,
2025-05-07T20:32:01.9125331Z         D: int,
2025-05-07T20:32:01.9125428Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9125516Z         contiguous: bool,
2025-05-07T20:32:01.9125608Z         compiled: bool,
2025-05-07T20:32:01.9125687Z     ) -> None:
2025-05-07T20:32:01.9125778Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9125856Z     
2025-05-07T20:32:01.9126026Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9126104Z     
2025-05-07T20:32:01.9126195Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9126321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9126428Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9126512Z         x0 = x[:, :D]
2025-05-07T20:32:01.9126591Z         x1 = x[:, D:]
2025-05-07T20:32:01.9126672Z     
2025-05-07T20:32:01.9126755Z         if contiguous:
2025-05-07T20:32:01.9126847Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9126942Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9127012Z     
2025-05-07T20:32:01.9127102Z         if scale_ub is not None:
2025-05-07T20:32:01.9127213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9127347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9127429Z             )
2025-05-07T20:32:01.9127503Z         else:
2025-05-07T20:32:01.9127599Z             scale_ub_tensor = None
2025-05-07T20:32:01.9127676Z     
2025-05-07T20:32:01.9127806Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9127897Z             op = silu_mul_quant
2025-05-07T20:32:01.9127990Z             if compiled:
2025-05-07T20:32:01.9128198Z                 op = torch.compile(op)
2025-05-07T20:32:01.9128313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9128395Z     
2025-05-07T20:32:01.9128487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9128492Z 
2025-05-07T20:32:01.9128593Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9128729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9128829Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9128933Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9129438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9129536Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9129911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9130144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9130533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9130635Z     kernel = self.compile(
2025-05-07T20:32:01.9131023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9131207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9131333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9131338Z 
2025-05-07T20:32:01.9131548Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3001d5850>
2025-05-07T20:32:01.9132336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9132842Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3009f23e0>}
2025-05-07T20:32:01.9133649Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9133842Z context = <triton._C.libtriton.ir.context object at 0x7fd300199e30>
2025-05-07T20:32:01.9133846Z 
2025-05-07T20:32:01.9134018Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9134282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9134389Z                            module_map=module_map)
2025-05-07T20:32:01.9134559Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9134656Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9134731Z E       ^
2025-05-07T20:32:01.9135103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9135111Z 
2025-05-07T20:32:01.9135529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9135534Z 
2025-05-07T20:32:01.9135644Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9135865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9135945Z     T=128,
2025-05-07T20:32:01.9136024Z     D=5120,
2025-05-07T20:32:01.9136108Z     scale_ub=None,
2025-05-07T20:32:01.9136194Z     contiguous=False,
2025-05-07T20:32:01.9136283Z     compiled=False,
2025-05-07T20:32:01.9136359Z )
2025-05-07T20:32:01.9136574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9136753Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.9136760Z 
2025-05-07T20:32:01.9136919Z     @given(
2025-05-07T20:32:01.9137046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9137145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9137257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9137383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9137494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9137569Z     )
2025-05-07T20:32:01.9137817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9137910Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9137992Z         self,
2025-05-07T20:32:01.9138069Z         T: int,
2025-05-07T20:32:01.9138145Z         D: int,
2025-05-07T20:32:01.9138246Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9138335Z         contiguous: bool,
2025-05-07T20:32:01.9138673Z         compiled: bool,
2025-05-07T20:32:01.9138800Z     ) -> None:
2025-05-07T20:32:01.9138942Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9139167Z     
2025-05-07T20:32:01.9139348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9139421Z     
2025-05-07T20:32:01.9139510Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9139644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9139732Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9139813Z         x0 = x[:, :D]
2025-05-07T20:32:01.9139902Z         x1 = x[:, D:]
2025-05-07T20:32:01.9139972Z     
2025-05-07T20:32:01.9140060Z         if contiguous:
2025-05-07T20:32:01.9140149Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9140235Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9140313Z     
2025-05-07T20:32:01.9140404Z         if scale_ub is not None:
2025-05-07T20:32:01.9140513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9140650Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9140726Z             )
2025-05-07T20:32:01.9140812Z         else:
2025-05-07T20:32:01.9140993Z             scale_ub_tensor = None
2025-05-07T20:32:01.9141061Z     
2025-05-07T20:32:01.9141190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9141290Z             op = silu_mul_quant
2025-05-07T20:32:01.9141374Z             if compiled:
2025-05-07T20:32:01.9141478Z                 op = torch.compile(op)
2025-05-07T20:32:01.9141587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9141659Z     
2025-05-07T20:32:01.9141756Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9141761Z 
2025-05-07T20:32:01.9141860Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9141989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9142093Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9142193Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9142699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9142807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9143171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9143401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9143743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9143838Z     kernel = self.compile(
2025-05-07T20:32:01.9144235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9144413Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9144547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9144552Z 
2025-05-07T20:32:01.9144756Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3001b87d0>
2025-05-07T20:32:01.9145669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9146188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3011e77e0>}
2025-05-07T20:32:01.9146942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9147139Z context = <triton._C.libtriton.ir.context object at 0x7fd300118d30>
2025-05-07T20:32:01.9147144Z 
2025-05-07T20:32:01.9147310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9147579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9147833Z                            module_map=module_map)
2025-05-07T20:32:01.9147994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9148100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9148179Z E       ^
2025-05-07T20:32:01.9148536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9148541Z 
2025-05-07T20:32:01.9148967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9148972Z 
2025-05-07T20:32:01.9149076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9149304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9149384Z     T=128,
2025-05-07T20:32:01.9149464Z     D=5120,
2025-05-07T20:32:01.9149555Z     scale_ub=1200.0,
2025-05-07T20:32:01.9149649Z     contiguous=True,
2025-05-07T20:32:01.9149821Z     compiled=False,
2025-05-07T20:32:01.9149901Z )
2025-05-07T20:32:01.9150119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9150290Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.9150295Z 
2025-05-07T20:32:01.9150378Z     @given(
2025-05-07T20:32:01.9150497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9150601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9150716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9150833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9150952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9151026Z     )
2025-05-07T20:32:01.9151270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9151372Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9151448Z         self,
2025-05-07T20:32:01.9151536Z         T: int,
2025-05-07T20:32:01.9151621Z         D: int,
2025-05-07T20:32:01.9151719Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9151810Z         contiguous: bool,
2025-05-07T20:32:01.9151904Z         compiled: bool,
2025-05-07T20:32:01.9151981Z     ) -> None:
2025-05-07T20:32:01.9152081Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9152156Z     
2025-05-07T20:32:01.9152331Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9152406Z     
2025-05-07T20:32:01.9152515Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9152639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9152729Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9152815Z         x0 = x[:, :D]
2025-05-07T20:32:01.9152895Z         x1 = x[:, D:]
2025-05-07T20:32:01.9152978Z     
2025-05-07T20:32:01.9153062Z         if contiguous:
2025-05-07T20:32:01.9153152Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9153331Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9153408Z     
2025-05-07T20:32:01.9153499Z         if scale_ub is not None:
2025-05-07T20:32:01.9153614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9153750Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9153829Z             )
2025-05-07T20:32:01.9153914Z         else:
2025-05-07T20:32:01.9154008Z             scale_ub_tensor = None
2025-05-07T20:32:01.9154084Z     
2025-05-07T20:32:01.9154222Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9154312Z             op = silu_mul_quant
2025-05-07T20:32:01.9154398Z             if compiled:
2025-05-07T20:32:01.9154504Z                 op = torch.compile(op)
2025-05-07T20:32:01.9154610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9154691Z     
2025-05-07T20:32:01.9154782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9154786Z 
2025-05-07T20:32:01.9154883Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9155067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9155169Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9155270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9155784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9155880Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9156252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9156480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9156824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9156923Z     kernel = self.compile(
2025-05-07T20:32:01.9157315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9157536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9157670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9157674Z 
2025-05-07T20:32:01.9157879Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30013f5d0>
2025-05-07T20:32:01.9158666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9159170Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3011e51c0>}
2025-05-07T20:32:01.9159940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9160135Z context = <triton._C.libtriton.ir.context object at 0x7fd30018fc30>
2025-05-07T20:32:01.9160139Z 
2025-05-07T20:32:01.9160306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9160580Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9160685Z                            module_map=module_map)
2025-05-07T20:32:01.9160857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9160954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9161032Z E       ^
2025-05-07T20:32:01.9161398Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9161402Z 
2025-05-07T20:32:01.9161820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9161929Z 
2025-05-07T20:32:01.9162035Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9162264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9162341Z     T=1,
2025-05-07T20:32:01.9162426Z     D=7168,
2025-05-07T20:32:01.9162510Z     scale_ub=1200.0,
2025-05-07T20:32:01.9162593Z     contiguous=True,
2025-05-07T20:32:01.9162685Z     compiled=True,
2025-05-07T20:32:01.9162759Z )
2025-05-07T20:32:01.9162977Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9163150Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9163155Z 
2025-05-07T20:32:01.9163338Z     @given(
2025-05-07T20:32:01.9163457Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9163561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9163676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9163811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9163972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9164046Z     )
2025-05-07T20:32:01.9164297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9164392Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9164471Z         self,
2025-05-07T20:32:01.9164552Z         T: int,
2025-05-07T20:32:01.9164630Z         D: int,
2025-05-07T20:32:01.9164727Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9164822Z         contiguous: bool,
2025-05-07T20:32:01.9164910Z         compiled: bool,
2025-05-07T20:32:01.9164989Z     ) -> None:
2025-05-07T20:32:01.9165091Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9165168Z     
2025-05-07T20:32:01.9165347Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9165419Z     
2025-05-07T20:32:01.9165511Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9165642Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9165785Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9165867Z         x0 = x[:, :D]
2025-05-07T20:32:01.9165957Z         x1 = x[:, D:]
2025-05-07T20:32:01.9166029Z     
2025-05-07T20:32:01.9166111Z         if contiguous:
2025-05-07T20:32:01.9166215Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9166305Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9166379Z     
2025-05-07T20:32:01.9166478Z         if scale_ub is not None:
2025-05-07T20:32:01.9166584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9166724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9166801Z             )
2025-05-07T20:32:01.9166881Z         else:
2025-05-07T20:32:01.9166983Z             scale_ub_tensor = None
2025-05-07T20:32:01.9167057Z     
2025-05-07T20:32:01.9167190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9167286Z             op = silu_mul_quant
2025-05-07T20:32:01.9167373Z             if compiled:
2025-05-07T20:32:01.9167479Z                 op = torch.compile(op)
2025-05-07T20:32:01.9167597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9167672Z     
2025-05-07T20:32:01.9167766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9167770Z 
2025-05-07T20:32:01.9167878Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9168009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9168116Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9168213Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9168588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9168689Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9169192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9169288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9169745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9169973Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9170325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9170419Z     kernel = self.compile(
2025-05-07T20:32:01.9170810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9170995Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9171122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9171127Z 
2025-05-07T20:32:01.9171338Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300856950>
2025-05-07T20:32:01.9172121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9172669Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3011e67a0>}
2025-05-07T20:32:01.9173435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9173626Z context = <triton._C.libtriton.ir.context object at 0x7fd3008befb0>
2025-05-07T20:32:01.9173630Z 
2025-05-07T20:32:01.9173802Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9174065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9174173Z                            module_map=module_map)
2025-05-07T20:32:01.9174399Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9174495Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9174582Z E       ^
2025-05-07T20:32:01.9174940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9174945Z 
2025-05-07T20:32:01.9175361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9175366Z 
2025-05-07T20:32:01.9175474Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9175697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9175774Z     T=1,
2025-05-07T20:32:01.9175858Z     D=7168,
2025-05-07T20:32:01.9175941Z     scale_ub=1200.0,
2025-05-07T20:32:01.9176035Z     contiguous=False,
2025-05-07T20:32:01.9176116Z     compiled=True,
2025-05-07T20:32:01.9176190Z )
2025-05-07T20:32:01.9176418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9176590Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9176595Z 
2025-05-07T20:32:01.9176672Z     @given(
2025-05-07T20:32:01.9176797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9176896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9177012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9177134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9177246Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9177327Z     )
2025-05-07T20:32:01.9177570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9177664Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9177748Z         self,
2025-05-07T20:32:01.9177825Z         T: int,
2025-05-07T20:32:01.9177903Z         D: int,
2025-05-07T20:32:01.9178699Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9178803Z         contiguous: bool,
2025-05-07T20:32:01.9178890Z         compiled: bool,
2025-05-07T20:32:01.9178978Z     ) -> None:
2025-05-07T20:32:01.9179074Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9179147Z     
2025-05-07T20:32:01.9179326Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9179400Z     
2025-05-07T20:32:01.9179499Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9179623Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9179713Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9179798Z         x0 = x[:, :D]
2025-05-07T20:32:01.9179883Z         x1 = x[:, D:]
2025-05-07T20:32:01.9179954Z     
2025-05-07T20:32:01.9180044Z         if contiguous:
2025-05-07T20:32:01.9180138Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9180225Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9180303Z     
2025-05-07T20:32:01.9180395Z         if scale_ub is not None:
2025-05-07T20:32:01.9180555Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9180698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9180775Z             )
2025-05-07T20:32:01.9180859Z         else:
2025-05-07T20:32:01.9180953Z             scale_ub_tensor = None
2025-05-07T20:32:01.9181027Z     
2025-05-07T20:32:01.9181163Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9181253Z             op = silu_mul_quant
2025-05-07T20:32:01.9181339Z             if compiled:
2025-05-07T20:32:01.9181444Z                 op = torch.compile(op)
2025-05-07T20:32:01.9181549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9181619Z     
2025-05-07T20:32:01.9181714Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9181719Z 
2025-05-07T20:32:01.9181818Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9181944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9182051Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9182206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9182587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9182681Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9183181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9183286Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9183649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9183881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9184224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9184319Z     kernel = self.compile(
2025-05-07T20:32:01.9184715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9184896Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9185024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9185028Z 
2025-05-07T20:32:01.9185241Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300870450>
2025-05-07T20:32:01.9186017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9186537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd301bb76a0>}
2025-05-07T20:32:01.9187371Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9187575Z context = <triton._C.libtriton.ir.context object at 0x7fd300838ab0>
2025-05-07T20:32:01.9187579Z 
2025-05-07T20:32:01.9187746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9188010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9189619Z                            module_map=module_map)
2025-05-07T20:32:01.9189782Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9189882Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9189968Z E       ^
2025-05-07T20:32:01.9190326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9190331Z 
2025-05-07T20:32:01.9190764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9190812Z 
2025-05-07T20:32:01.9190916Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9191139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9191224Z     T=1,
2025-05-07T20:32:01.9191300Z     D=7168,
2025-05-07T20:32:01.9191381Z     scale_ub=None,
2025-05-07T20:32:01.9191474Z     contiguous=False,
2025-05-07T20:32:01.9191558Z     compiled=True,
2025-05-07T20:32:01.9191642Z )
2025-05-07T20:32:01.9191860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9192025Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9192030Z 
2025-05-07T20:32:01.9192114Z     @given(
2025-05-07T20:32:01.9192229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9192329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9192451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9192574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9192761Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9192839Z     )
2025-05-07T20:32:01.9193083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9193181Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9193256Z         self,
2025-05-07T20:32:01.9193331Z         T: int,
2025-05-07T20:32:01.9193414Z         D: int,
2025-05-07T20:32:01.9193514Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9193602Z         contiguous: bool,
2025-05-07T20:32:01.9193694Z         compiled: bool,
2025-05-07T20:32:01.9193774Z     ) -> None:
2025-05-07T20:32:01.9193866Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9193945Z     
2025-05-07T20:32:01.9194114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9194189Z     
2025-05-07T20:32:01.9194294Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9194442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9194560Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9194641Z         x0 = x[:, :D]
2025-05-07T20:32:01.9194719Z         x1 = x[:, D:]
2025-05-07T20:32:01.9194800Z     
2025-05-07T20:32:01.9194882Z         if contiguous:
2025-05-07T20:32:01.9194973Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9195070Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9195143Z     
2025-05-07T20:32:01.9195233Z         if scale_ub is not None:
2025-05-07T20:32:01.9195347Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9195484Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9195559Z             )
2025-05-07T20:32:01.9195644Z         else:
2025-05-07T20:32:01.9195736Z             scale_ub_tensor = None
2025-05-07T20:32:01.9195817Z     
2025-05-07T20:32:01.9195947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9196036Z             op = silu_mul_quant
2025-05-07T20:32:01.9196215Z             if compiled:
2025-05-07T20:32:01.9196317Z                 op = torch.compile(op)
2025-05-07T20:32:01.9196422Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9196500Z     
2025-05-07T20:32:01.9196594Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.9196716Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.9196795Z     
2025-05-07T20:32:01.9196935Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9197035Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.9197145Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.9197269Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.9197413Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9197484Z     
2025-05-07T20:32:01.9197587Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.9197592Z 
2025-05-07T20:32:01.9197694Z moe/activation_test.py:126: 
2025-05-07T20:32:01.9197873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9197980Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.9198129Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9198700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.9198809Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.9199175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9199396Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9199776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.9200033Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9200446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.9200752Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9201132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.9201306Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.9201652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.9201730Z     fn()
2025-05-07T20:32:01.9202144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.9202227Z     self.fn.run(
2025-05-07T20:32:01.9202580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9202682Z     kernel = self.compile(
2025-05-07T20:32:01.9203071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9203406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9203532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9203537Z 
2025-05-07T20:32:01.9203743Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3006aec50>
2025-05-07T20:32:01.9204530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9205032Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3968dede0>}
2025-05-07T20:32:01.9205877Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9206071Z context = <triton._C.libtriton.ir.context object at 0x7fd3006d44b0>
2025-05-07T20:32:01.9206076Z 
2025-05-07T20:32:01.9206251Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9206519Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9206626Z                            module_map=module_map)
2025-05-07T20:32:01.9206795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9206896Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.9206974Z E       ^
2025-05-07T20:32:01.9207341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9207410Z 
2025-05-07T20:32:01.9209036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9209041Z 
2025-05-07T20:32:01.9209150Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9209373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9209453Z     T=1,
2025-05-07T20:32:01.9209536Z     D=5120,
2025-05-07T20:32:01.9209621Z     scale_ub=1200.0,
2025-05-07T20:32:01.9209708Z     contiguous=False,
2025-05-07T20:32:01.9209798Z     compiled=True,
2025-05-07T20:32:01.9209873Z )
2025-05-07T20:32:01.9210097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9210265Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9210270Z 
2025-05-07T20:32:01.9210349Z     @given(
2025-05-07T20:32:01.9210477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9210582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9210745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9210869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9210982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9211058Z     )
2025-05-07T20:32:01.9211307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9211401Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9211484Z         self,
2025-05-07T20:32:01.9211563Z         T: int,
2025-05-07T20:32:01.9211637Z         D: int,
2025-05-07T20:32:01.9211740Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9211830Z         contiguous: bool,
2025-05-07T20:32:01.9211916Z         compiled: bool,
2025-05-07T20:32:01.9212001Z     ) -> None:
2025-05-07T20:32:01.9212095Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9212169Z     
2025-05-07T20:32:01.9212345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9212427Z     
2025-05-07T20:32:01.9212523Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9212654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9212743Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9212829Z         x0 = x[:, :D]
2025-05-07T20:32:01.9212909Z         x1 = x[:, D:]
2025-05-07T20:32:01.9212983Z     
2025-05-07T20:32:01.9213076Z         if contiguous:
2025-05-07T20:32:01.9213166Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9213256Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9213337Z     
2025-05-07T20:32:01.9213424Z         if scale_ub is not None:
2025-05-07T20:32:01.9213530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9213669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9213744Z             )
2025-05-07T20:32:01.9213821Z         else:
2025-05-07T20:32:01.9213923Z             scale_ub_tensor = None
2025-05-07T20:32:01.9213996Z     
2025-05-07T20:32:01.9214206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9214310Z             op = silu_mul_quant
2025-05-07T20:32:01.9214397Z             if compiled:
2025-05-07T20:32:01.9214505Z                 op = torch.compile(op)
2025-05-07T20:32:01.9214611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9214686Z     
2025-05-07T20:32:01.9214783Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9214788Z 
2025-05-07T20:32:01.9214883Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9215009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9215116Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9215215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9215593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9215687Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9228388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9228586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9228973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9229201Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9229550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9229656Z     kernel = self.compile(
2025-05-07T20:32:01.9230046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9230226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9230365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9230371Z 
2025-05-07T20:32:01.9230587Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30062f210>
2025-05-07T20:32:01.9231420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9231920Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd396161d00>}
2025-05-07T20:32:01.9232680Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9232871Z context = <triton._C.libtriton.ir.context object at 0x7fd300667870>
2025-05-07T20:32:01.9232876Z 
2025-05-07T20:32:01.9233043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9233324Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9233437Z                            module_map=module_map)
2025-05-07T20:32:01.9233602Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9233710Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9233791Z E       ^
2025-05-07T20:32:01.9234158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9234163Z 
2025-05-07T20:32:01.9234583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9234587Z 
2025-05-07T20:32:01.9234695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9234929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9235008Z     T=1,
2025-05-07T20:32:01.9235094Z     D=5120,
2025-05-07T20:32:01.9235182Z     scale_ub=1200.0,
2025-05-07T20:32:01.9235358Z     contiguous=False,
2025-05-07T20:32:01.9235457Z     compiled=False,
2025-05-07T20:32:01.9235532Z )
2025-05-07T20:32:01.9235752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9235933Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9235938Z 
2025-05-07T20:32:01.9236017Z     @given(
2025-05-07T20:32:01.9236138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9236246Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9236363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9236494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9236610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9236687Z     )
2025-05-07T20:32:01.9236942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9237038Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9237165Z         self,
2025-05-07T20:32:01.9237266Z         T: int,
2025-05-07T20:32:01.9237346Z         D: int,
2025-05-07T20:32:01.9237445Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9237545Z         contiguous: bool,
2025-05-07T20:32:01.9237639Z         compiled: bool,
2025-05-07T20:32:01.9237720Z     ) -> None:
2025-05-07T20:32:01.9237827Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9237901Z     
2025-05-07T20:32:01.9238083Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9238160Z     
2025-05-07T20:32:01.9238256Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9238678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9238812Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9238926Z         x0 = x[:, :D]
2025-05-07T20:32:01.9239052Z         x1 = x[:, D:]
2025-05-07T20:32:01.9239130Z     
2025-05-07T20:32:01.9239218Z         if contiguous:
2025-05-07T20:32:01.9239322Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9239588Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9239664Z     
2025-05-07T20:32:01.9239768Z         if scale_ub is not None:
2025-05-07T20:32:01.9239876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9240010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9240099Z             )
2025-05-07T20:32:01.9240179Z         else:
2025-05-07T20:32:01.9240284Z             scale_ub_tensor = None
2025-05-07T20:32:01.9240357Z     
2025-05-07T20:32:01.9240490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9240586Z             op = silu_mul_quant
2025-05-07T20:32:01.9240673Z             if compiled:
2025-05-07T20:32:01.9240772Z                 op = torch.compile(op)
2025-05-07T20:32:01.9240887Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9240959Z     
2025-05-07T20:32:01.9241056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9241060Z 
2025-05-07T20:32:01.9241167Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9241306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9241417Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9241519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9242029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9242137Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9242499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9242724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9243076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9243170Z     kernel = self.compile(
2025-05-07T20:32:01.9243785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9243968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9244096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9244101Z 
2025-05-07T20:32:01.9244314Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300670e50>
2025-05-07T20:32:01.9245083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9245591Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3006d8b80>}
2025-05-07T20:32:01.9246353Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9246605Z context = <triton._C.libtriton.ir.context object at 0x7fd30060b3b0>
2025-05-07T20:32:01.9246622Z 
2025-05-07T20:32:01.9246787Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9247051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9247167Z                            module_map=module_map)
2025-05-07T20:32:01.9247330Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9247430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9247516Z E       ^
2025-05-07T20:32:01.9247872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9247877Z 
2025-05-07T20:32:01.9248309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9248359Z 
2025-05-07T20:32:01.9248466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9248689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9248779Z     T=16384,
2025-05-07T20:32:01.9248857Z     D=5120,
2025-05-07T20:32:01.9248939Z     scale_ub=1200.0,
2025-05-07T20:32:01.9249036Z     contiguous=False,
2025-05-07T20:32:01.9249121Z     compiled=True,
2025-05-07T20:32:01.9249197Z )
2025-05-07T20:32:01.9249423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9249601Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9249605Z 
2025-05-07T20:32:01.9249692Z     @given(
2025-05-07T20:32:01.9249810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9249908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9250031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9250154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9250270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9250354Z     )
2025-05-07T20:32:01.9250599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9250702Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9250780Z         self,
2025-05-07T20:32:01.9250859Z         T: int,
2025-05-07T20:32:01.9250944Z         D: int,
2025-05-07T20:32:01.9251043Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9251133Z         contiguous: bool,
2025-05-07T20:32:01.9251229Z         compiled: bool,
2025-05-07T20:32:01.9251308Z     ) -> None:
2025-05-07T20:32:01.9251402Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9251485Z     
2025-05-07T20:32:01.9251655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9251729Z     
2025-05-07T20:32:01.9251831Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9252037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9252134Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9252222Z         x0 = x[:, :D]
2025-05-07T20:32:01.9252303Z         x1 = x[:, D:]
2025-05-07T20:32:01.9252393Z     
2025-05-07T20:32:01.9252483Z         if contiguous:
2025-05-07T20:32:01.9252575Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9252672Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9252747Z     
2025-05-07T20:32:01.9252835Z         if scale_ub is not None:
2025-05-07T20:32:01.9252951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9253084Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9253164Z             )
2025-05-07T20:32:01.9253252Z         else:
2025-05-07T20:32:01.9253348Z             scale_ub_tensor = None
2025-05-07T20:32:01.9253423Z     
2025-05-07T20:32:01.9253561Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9253652Z             op = silu_mul_quant
2025-05-07T20:32:01.9253787Z             if compiled:
2025-05-07T20:32:01.9253892Z                 op = torch.compile(op)
2025-05-07T20:32:01.9253996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9254071Z     
2025-05-07T20:32:01.9254170Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9254174Z 
2025-05-07T20:32:01.9254274Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9254402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9254515Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9254615Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9254995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9255084Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9255585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9255691Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9256107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9256331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9256681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9256775Z     kernel = self.compile(
2025-05-07T20:32:01.9257167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9257343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9257471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9257476Z 
2025-05-07T20:32:01.9257690Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfda7ad0>
2025-05-07T20:32:01.9258465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9258972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3006d9e40>}
2025-05-07T20:32:01.9259723Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9259912Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfd34130>
2025-05-07T20:32:01.9259927Z 
2025-05-07T20:32:01.9260090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9260351Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9260547Z                            module_map=module_map)
2025-05-07T20:32:01.9260714Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9260811Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9260893Z E       ^
2025-05-07T20:32:01.9261248Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9261253Z 
2025-05-07T20:32:01.9261672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9261676Z 
2025-05-07T20:32:01.9261777Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9262000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9262085Z     T=2048,
2025-05-07T20:32:01.9262161Z     D=7168,
2025-05-07T20:32:01.9262242Z     scale_ub=1200.0,
2025-05-07T20:32:01.9262331Z     contiguous=False,
2025-05-07T20:32:01.9262411Z     compiled=True,
2025-05-07T20:32:01.9262537Z )
2025-05-07T20:32:01.9262766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9262940Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9262944Z 
2025-05-07T20:32:01.9263028Z     @given(
2025-05-07T20:32:01.9263147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9263243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9263362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9263477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9263588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9263667Z     )
2025-05-07T20:32:01.9263910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9264008Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9264087Z         self,
2025-05-07T20:32:01.9264164Z         T: int,
2025-05-07T20:32:01.9264266Z         D: int,
2025-05-07T20:32:01.9264437Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9264530Z         contiguous: bool,
2025-05-07T20:32:01.9264618Z         compiled: bool,
2025-05-07T20:32:01.9264694Z     ) -> None:
2025-05-07T20:32:01.9264788Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9264868Z     
2025-05-07T20:32:01.9265035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9265107Z     
2025-05-07T20:32:01.9265209Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9265335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9265423Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9265512Z         x0 = x[:, :D]
2025-05-07T20:32:01.9265590Z         x1 = x[:, D:]
2025-05-07T20:32:01.9265673Z     
2025-05-07T20:32:01.9265758Z         if contiguous:
2025-05-07T20:32:01.9265845Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9265940Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9266011Z     
2025-05-07T20:32:01.9266104Z         if scale_ub is not None:
2025-05-07T20:32:01.9266223Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9266357Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9266431Z             )
2025-05-07T20:32:01.9266516Z         else:
2025-05-07T20:32:01.9266610Z             scale_ub_tensor = None
2025-05-07T20:32:01.9266684Z     
2025-05-07T20:32:01.9266824Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9266914Z             op = silu_mul_quant
2025-05-07T20:32:01.9267005Z             if compiled:
2025-05-07T20:32:01.9267107Z                 op = torch.compile(op)
2025-05-07T20:32:01.9267210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9267292Z     
2025-05-07T20:32:01.9267385Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9267390Z 
2025-05-07T20:32:01.9267485Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9267620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9267804Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9267908Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9268283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9268375Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9268882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9268980Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9269339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9269566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9269906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9270000Z     kernel = self.compile(
2025-05-07T20:32:01.9270445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9270619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9270754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9270759Z 
2025-05-07T20:32:01.9270964Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfd4ce10>
2025-05-07T20:32:01.9271732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9272238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3006da980>}
2025-05-07T20:32:01.9272996Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9273247Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfdf9470>
2025-05-07T20:32:01.9273251Z 
2025-05-07T20:32:01.9273416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9273685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9273793Z                            module_map=module_map)
2025-05-07T20:32:01.9273953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9274057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9274133Z E       ^
2025-05-07T20:32:01.9274488Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9274493Z 
2025-05-07T20:32:01.9274917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9274927Z 
2025-05-07T20:32:01.9275031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9275258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9275331Z     T=1,
2025-05-07T20:32:01.9275405Z     D=5120,
2025-05-07T20:32:01.9275492Z     scale_ub=None,
2025-05-07T20:32:01.9275583Z     contiguous=False,
2025-05-07T20:32:01.9275666Z     compiled=False,
2025-05-07T20:32:01.9275747Z )
2025-05-07T20:32:01.9275964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9276131Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.9276143Z 
2025-05-07T20:32:01.9276219Z     @given(
2025-05-07T20:32:01.9276335Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9276439Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9276657Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9276782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9276902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9276974Z     )
2025-05-07T20:32:01.9277218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9277317Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9277391Z         self,
2025-05-07T20:32:01.9277470Z         T: int,
2025-05-07T20:32:01.9277553Z         D: int,
2025-05-07T20:32:01.9277650Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9277747Z         contiguous: bool,
2025-05-07T20:32:01.9277829Z         compiled: bool,
2025-05-07T20:32:01.9277905Z     ) -> None:
2025-05-07T20:32:01.9278004Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9278079Z     
2025-05-07T20:32:01.9278248Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9278330Z     
2025-05-07T20:32:01.9278471Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9278601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9278702Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9278787Z         x0 = x[:, :D]
2025-05-07T20:32:01.9278869Z         x1 = x[:, D:]
2025-05-07T20:32:01.9278956Z     
2025-05-07T20:32:01.9279046Z         if contiguous:
2025-05-07T20:32:01.9279151Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9279242Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9279318Z     
2025-05-07T20:32:01.9279419Z         if scale_ub is not None:
2025-05-07T20:32:01.9279525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9279661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9279749Z             )
2025-05-07T20:32:01.9279829Z         else:
2025-05-07T20:32:01.9279925Z             scale_ub_tensor = None
2025-05-07T20:32:01.9280011Z     
2025-05-07T20:32:01.9280143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9280244Z             op = silu_mul_quant
2025-05-07T20:32:01.9280384Z             if compiled:
2025-05-07T20:32:01.9280486Z                 op = torch.compile(op)
2025-05-07T20:32:01.9280602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9280677Z     
2025-05-07T20:32:01.9280770Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9280774Z 
2025-05-07T20:32:01.9280885Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9281018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9281118Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9281229Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9281729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9281827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9282205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9282433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9282786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9282880Z     kernel = self.compile(
2025-05-07T20:32:01.9283328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9283514Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9283643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9283647Z 
2025-05-07T20:32:01.9283865Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30044ab50>
2025-05-07T20:32:01.9284774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9285281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3004b4220>}
2025-05-07T20:32:01.9286044Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9286238Z context = <triton._C.libtriton.ir.context object at 0x7fd3004c7130>
2025-05-07T20:32:01.9286243Z 
2025-05-07T20:32:01.9286424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9286690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9286797Z                            module_map=module_map)
2025-05-07T20:32:01.9286968Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9287117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9287201Z E       ^
2025-05-07T20:32:01.9287557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9287562Z 
2025-05-07T20:32:01.9287978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9287982Z 
2025-05-07T20:32:01.9288087Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9288307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9288393Z     T=4096,
2025-05-07T20:32:01.9288471Z     D=7168,
2025-05-07T20:32:01.9288553Z     scale_ub=1200.0,
2025-05-07T20:32:01.9288646Z     contiguous=False,
2025-05-07T20:32:01.9288728Z     compiled=False,
2025-05-07T20:32:01.9288804Z )
2025-05-07T20:32:01.9289026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9289211Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9289257Z 
2025-05-07T20:32:01.9289333Z     @given(
2025-05-07T20:32:01.9289460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9289556Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9289670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9289791Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9289902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9289983Z     )
2025-05-07T20:32:01.9290224Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9290315Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9290394Z         self,
2025-05-07T20:32:01.9290467Z         T: int,
2025-05-07T20:32:01.9290543Z         D: int,
2025-05-07T20:32:01.9290642Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9290730Z         contiguous: bool,
2025-05-07T20:32:01.9290821Z         compiled: bool,
2025-05-07T20:32:01.9290913Z     ) -> None:
2025-05-07T20:32:01.9291008Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9291083Z     
2025-05-07T20:32:01.9291259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9291331Z     
2025-05-07T20:32:01.9291419Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9291549Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9291637Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9291713Z         x0 = x[:, :D]
2025-05-07T20:32:01.9291799Z         x1 = x[:, D:]
2025-05-07T20:32:01.9291871Z     
2025-05-07T20:32:01.9291955Z         if contiguous:
2025-05-07T20:32:01.9292057Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9292143Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9292222Z     
2025-05-07T20:32:01.9292311Z         if scale_ub is not None:
2025-05-07T20:32:01.9292416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9292639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9292718Z             )
2025-05-07T20:32:01.9292794Z         else:
2025-05-07T20:32:01.9292894Z             scale_ub_tensor = None
2025-05-07T20:32:01.9292966Z     
2025-05-07T20:32:01.9293093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9293186Z             op = silu_mul_quant
2025-05-07T20:32:01.9293271Z             if compiled:
2025-05-07T20:32:01.9293369Z                 op = torch.compile(op)
2025-05-07T20:32:01.9293479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9293553Z     
2025-05-07T20:32:01.9293653Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9293657Z 
2025-05-07T20:32:01.9293754Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9293881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9293985Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9294083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9294631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9294737Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9295097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9295325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9295670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9295762Z     kernel = self.compile(
2025-05-07T20:32:01.9296152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9296329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9296454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9296511Z 
2025-05-07T20:32:01.9296715Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30040da10>
2025-05-07T20:32:01.9297483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9297987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3004b5440>}
2025-05-07T20:32:01.9298737Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9298934Z context = <triton._C.libtriton.ir.context object at 0x7fd300499fb0>
2025-05-07T20:32:01.9298939Z 
2025-05-07T20:32:01.9299107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9299371Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9299482Z                            module_map=module_map)
2025-05-07T20:32:01.9299640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9299739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9299818Z E       ^
2025-05-07T20:32:01.9300170Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9300175Z 
2025-05-07T20:32:01.9300595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9300600Z 
2025-05-07T20:32:01.9300702Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9300922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9301007Z     T=16384,
2025-05-07T20:32:01.9301165Z     D=7168,
2025-05-07T20:32:01.9301255Z     scale_ub=None,
2025-05-07T20:32:01.9301340Z     contiguous=True,
2025-05-07T20:32:01.9301421Z     compiled=True,
2025-05-07T20:32:01.9301498Z )
2025-05-07T20:32:01.9301714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9301887Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.9301892Z 
2025-05-07T20:32:01.9301973Z     @given(
2025-05-07T20:32:01.9302089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9302188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9302307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9302421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9302539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9302610Z     )
2025-05-07T20:32:01.9302856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9302996Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9303072Z         self,
2025-05-07T20:32:01.9303146Z         T: int,
2025-05-07T20:32:01.9303226Z         D: int,
2025-05-07T20:32:01.9303323Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9303408Z         contiguous: bool,
2025-05-07T20:32:01.9303500Z         compiled: bool,
2025-05-07T20:32:01.9303577Z     ) -> None:
2025-05-07T20:32:01.9303669Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9303749Z     
2025-05-07T20:32:01.9303916Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9303996Z     
2025-05-07T20:32:01.9304086Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9304208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9304301Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9304380Z         x0 = x[:, :D]
2025-05-07T20:32:01.9304458Z         x1 = x[:, D:]
2025-05-07T20:32:01.9304539Z     
2025-05-07T20:32:01.9304630Z         if contiguous:
2025-05-07T20:32:01.9304787Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9304885Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9304953Z     
2025-05-07T20:32:01.9305042Z         if scale_ub is not None:
2025-05-07T20:32:01.9305154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9305287Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9305367Z             )
2025-05-07T20:32:01.9305443Z         else:
2025-05-07T20:32:01.9305536Z             scale_ub_tensor = None
2025-05-07T20:32:01.9305611Z     
2025-05-07T20:32:01.9305741Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9305831Z             op = silu_mul_quant
2025-05-07T20:32:01.9305922Z             if compiled:
2025-05-07T20:32:01.9306021Z                 op = torch.compile(op)
2025-05-07T20:32:01.9306124Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9306205Z     
2025-05-07T20:32:01.9306295Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9306313Z 
2025-05-07T20:32:01.9306410Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9306544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9306645Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9306749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9307121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9307218Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9307726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9307827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9308190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9308419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9308843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9308949Z     kernel = self.compile(
2025-05-07T20:32:01.9309334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9309507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9309644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9309649Z 
2025-05-07T20:32:01.9309852Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3000343d0>
2025-05-07T20:32:01.9310628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9311135Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3004b6520>}
2025-05-07T20:32:01.9311931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9312129Z context = <triton._C.libtriton.ir.context object at 0x7fd3000e8a30>
2025-05-07T20:32:01.9312133Z 
2025-05-07T20:32:01.9312296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9312567Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9312671Z                            module_map=module_map)
2025-05-07T20:32:01.9312831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9312932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9313007Z E       ^
2025-05-07T20:32:01.9313375Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9313421Z 
2025-05-07T20:32:01.9313833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9313837Z 
2025-05-07T20:32:01.9313938Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9314163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9314236Z     T=4096,
2025-05-07T20:32:01.9314312Z     D=5120,
2025-05-07T20:32:01.9314395Z     scale_ub=None,
2025-05-07T20:32:01.9314478Z     contiguous=False,
2025-05-07T20:32:01.9314566Z     compiled=True,
2025-05-07T20:32:01.9314640Z )
2025-05-07T20:32:01.9314855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9315032Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9315036Z 
2025-05-07T20:32:01.9315118Z     @given(
2025-05-07T20:32:01.9315237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9315338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9315452Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9315566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9315681Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9315754Z     )
2025-05-07T20:32:01.9316002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9316092Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9316168Z         self,
2025-05-07T20:32:01.9316250Z         T: int,
2025-05-07T20:32:01.9316324Z         D: int,
2025-05-07T20:32:01.9316420Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9316515Z         contiguous: bool,
2025-05-07T20:32:01.9316599Z         compiled: bool,
2025-05-07T20:32:01.9316674Z     ) -> None:
2025-05-07T20:32:01.9316772Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9316927Z     
2025-05-07T20:32:01.9317100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9317176Z     
2025-05-07T20:32:01.9317266Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9317393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9317479Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9317556Z         x0 = x[:, :D]
2025-05-07T20:32:01.9317637Z         x1 = x[:, D:]
2025-05-07T20:32:01.9317708Z     
2025-05-07T20:32:01.9317790Z         if contiguous:
2025-05-07T20:32:01.9317884Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9317970Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9318040Z     
2025-05-07T20:32:01.9318134Z         if scale_ub is not None:
2025-05-07T20:32:01.9318240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9318374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9318457Z             )
2025-05-07T20:32:01.9318583Z         else:
2025-05-07T20:32:01.9318677Z             scale_ub_tensor = None
2025-05-07T20:32:01.9318753Z     
2025-05-07T20:32:01.9318881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9318977Z             op = silu_mul_quant
2025-05-07T20:32:01.9319060Z             if compiled:
2025-05-07T20:32:01.9319159Z                 op = torch.compile(op)
2025-05-07T20:32:01.9319268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9319340Z     
2025-05-07T20:32:01.9319429Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9319433Z 
2025-05-07T20:32:01.9319536Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9319664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9319761Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9319865Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9320238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9320393Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9320895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9320990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9321356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9321577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9321928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9322018Z     kernel = self.compile(
2025-05-07T20:32:01.9322402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9322584Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9322713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9322723Z 
2025-05-07T20:32:01.9322929Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3000e59d0>
2025-05-07T20:32:01.9323790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9324346Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3004b6c00>}
2025-05-07T20:32:01.9325109Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9325297Z context = <triton._C.libtriton.ir.context object at 0x7fd300032030>
2025-05-07T20:32:01.9325383Z 
2025-05-07T20:32:01.9325557Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9325821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9325928Z                            module_map=module_map)
2025-05-07T20:32:01.9326094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9326190Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9326267Z E       ^
2025-05-07T20:32:01.9326629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9326633Z 
2025-05-07T20:32:01.9327044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9327049Z 
2025-05-07T20:32:01.9327159Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9327384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9327502Z     T=4096,
2025-05-07T20:32:01.9327582Z     D=5120,
2025-05-07T20:32:01.9327667Z     scale_ub=1200.0,
2025-05-07T20:32:01.9327751Z     contiguous=False,
2025-05-07T20:32:01.9327839Z     compiled=False,
2025-05-07T20:32:01.9327910Z )
2025-05-07T20:32:01.9328133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9328308Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9328312Z 
2025-05-07T20:32:01.9328389Z     @given(
2025-05-07T20:32:01.9328513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9328608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9328721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9328846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9328962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9329036Z     )
2025-05-07T20:32:01.9329291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9329427Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9329510Z         self,
2025-05-07T20:32:01.9329586Z         T: int,
2025-05-07T20:32:01.9329663Z         D: int,
2025-05-07T20:32:01.9329766Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9329854Z         contiguous: bool,
2025-05-07T20:32:01.9329940Z         compiled: bool,
2025-05-07T20:32:01.9330023Z     ) -> None:
2025-05-07T20:32:01.9330115Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9330187Z     
2025-05-07T20:32:01.9330365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9330439Z     
2025-05-07T20:32:01.9330533Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9330662Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9330748Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9330829Z         x0 = x[:, :D]
2025-05-07T20:32:01.9330907Z         x1 = x[:, D:]
2025-05-07T20:32:01.9330992Z     
2025-05-07T20:32:01.9331079Z         if contiguous:
2025-05-07T20:32:01.9331173Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9331261Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9331338Z     
2025-05-07T20:32:01.9331426Z         if scale_ub is not None:
2025-05-07T20:32:01.9331529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9331669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9331741Z             )
2025-05-07T20:32:01.9331819Z         else:
2025-05-07T20:32:01.9331917Z             scale_ub_tensor = None
2025-05-07T20:32:01.9331988Z     
2025-05-07T20:32:01.9332115Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9332209Z             op = silu_mul_quant
2025-05-07T20:32:01.9332294Z             if compiled:
2025-05-07T20:32:01.9332397Z                 op = torch.compile(op)
2025-05-07T20:32:01.9332499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9332654Z     
2025-05-07T20:32:01.9332753Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9332757Z 
2025-05-07T20:32:01.9332858Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9332986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9333086Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9333185Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9333694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9333788Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9334147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9334375Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9334718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9334880Z     kernel = self.compile(
2025-05-07T20:32:01.9335272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9335444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9335578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9335583Z 
2025-05-07T20:32:01.9335785Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3000dacd0>
2025-05-07T20:32:01.9336555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9337059Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3000b8400>}
2025-05-07T20:32:01.9337860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9338052Z context = <triton._C.libtriton.ir.context object at 0x7fd3000d3330>
2025-05-07T20:32:01.9338057Z 
2025-05-07T20:32:01.9338221Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9338685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9338847Z                            module_map=module_map)
2025-05-07T20:32:01.9339014Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9339117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9339192Z E       ^
2025-05-07T20:32:01.9339555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9339576Z 
2025-05-07T20:32:01.9339998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9340002Z 
2025-05-07T20:32:01.9340103Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9340328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9340404Z     T=4096,
2025-05-07T20:32:01.9340479Z     D=5120,
2025-05-07T20:32:01.9340570Z     scale_ub=1200.0,
2025-05-07T20:32:01.9340651Z     contiguous=False,
2025-05-07T20:32:01.9340729Z     compiled=True,
2025-05-07T20:32:01.9340805Z )
2025-05-07T20:32:01.9341020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9341193Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9341198Z 
2025-05-07T20:32:01.9341278Z     @given(
2025-05-07T20:32:01.9341396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9341732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9341847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9341961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9342082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9342155Z     )
2025-05-07T20:32:01.9342400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9342496Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9342568Z         self,
2025-05-07T20:32:01.9342644Z         T: int,
2025-05-07T20:32:01.9342726Z         D: int,
2025-05-07T20:32:01.9342822Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9342908Z         contiguous: bool,
2025-05-07T20:32:01.9342997Z         compiled: bool,
2025-05-07T20:32:01.9343073Z     ) -> None:
2025-05-07T20:32:01.9343171Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9343241Z     
2025-05-07T20:32:01.9343415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9343557Z     
2025-05-07T20:32:01.9343647Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9343773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9343864Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9343940Z         x0 = x[:, :D]
2025-05-07T20:32:01.9344018Z         x1 = x[:, D:]
2025-05-07T20:32:01.9344092Z     
2025-05-07T20:32:01.9344175Z         if contiguous:
2025-05-07T20:32:01.9344264Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9344355Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9344427Z     
2025-05-07T20:32:01.9344520Z         if scale_ub is not None:
2025-05-07T20:32:01.9344621Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9344752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9344832Z             )
2025-05-07T20:32:01.9344903Z         else:
2025-05-07T20:32:01.9344998Z             scale_ub_tensor = None
2025-05-07T20:32:01.9345080Z     
2025-05-07T20:32:01.9345215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9345370Z             op = silu_mul_quant
2025-05-07T20:32:01.9345461Z             if compiled:
2025-05-07T20:32:01.9345558Z                 op = torch.compile(op)
2025-05-07T20:32:01.9345664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9345741Z     
2025-05-07T20:32:01.9345831Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9345836Z 
2025-05-07T20:32:01.9345938Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9346066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9346164Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9346265Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9346634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9346724Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9347229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9347328Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9347690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9347910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9348251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9348346Z     kernel = self.compile(
2025-05-07T20:32:01.9348730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9348901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9349036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9349040Z 
2025-05-07T20:32:01.9349333Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3005c1790>
2025-05-07T20:32:01.9350111Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9350608Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3000b9620>}
2025-05-07T20:32:01.9351365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9351551Z context = <triton._C.libtriton.ir.context object at 0x7fd3005b9df0>
2025-05-07T20:32:01.9351555Z 
2025-05-07T20:32:01.9355567Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9355937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9356055Z                            module_map=module_map)
2025-05-07T20:32:01.9356217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9356315Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9356394Z E       ^
2025-05-07T20:32:01.9356755Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9356761Z 
2025-05-07T20:32:01.9357185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9357189Z 
2025-05-07T20:32:01.9357289Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9357512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9357595Z     T=2048,
2025-05-07T20:32:01.9357674Z     D=7168,
2025-05-07T20:32:01.9357762Z     scale_ub=1200.0,
2025-05-07T20:32:01.9357899Z     contiguous=False,
2025-05-07T20:32:01.9357980Z     compiled=False,
2025-05-07T20:32:01.9358050Z )
2025-05-07T20:32:01.9358271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9358444Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9358449Z 
2025-05-07T20:32:01.9358530Z     @given(
2025-05-07T20:32:01.9358648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9358743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9358865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9358981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9359094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9359179Z     )
2025-05-07T20:32:01.9359422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9359521Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9359610Z         self,
2025-05-07T20:32:01.9359689Z         T: int,
2025-05-07T20:32:01.9359768Z         D: int,
2025-05-07T20:32:01.9359865Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9359955Z         contiguous: bool,
2025-05-07T20:32:01.9360044Z         compiled: bool,
2025-05-07T20:32:01.9360122Z     ) -> None:
2025-05-07T20:32:01.9360214Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9360291Z     
2025-05-07T20:32:01.9360461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9360535Z     
2025-05-07T20:32:01.9360635Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9360756Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9360845Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9360925Z         x0 = x[:, :D]
2025-05-07T20:32:01.9361005Z         x1 = x[:, D:]
2025-05-07T20:32:01.9361082Z     
2025-05-07T20:32:01.9361162Z         if contiguous:
2025-05-07T20:32:01.9361337Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9361438Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9361509Z     
2025-05-07T20:32:01.9361595Z         if scale_ub is not None:
2025-05-07T20:32:01.9361703Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9361835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9361909Z             )
2025-05-07T20:32:01.9361986Z         else:
2025-05-07T20:32:01.9362076Z             scale_ub_tensor = None
2025-05-07T20:32:01.9362148Z     
2025-05-07T20:32:01.9362278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9362367Z             op = silu_mul_quant
2025-05-07T20:32:01.9362447Z             if compiled:
2025-05-07T20:32:01.9362548Z                 op = torch.compile(op)
2025-05-07T20:32:01.9362655Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9362732Z     
2025-05-07T20:32:01.9362821Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9362825Z 
2025-05-07T20:32:01.9362975Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9363110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9363303Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9363401Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9363903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9364001Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9364364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9364583Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9364923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9365016Z     kernel = self.compile(
2025-05-07T20:32:01.9365403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9365625Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9365754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9365761Z 
2025-05-07T20:32:01.9365962Z self = <triton.compiler.compiler.ASTSource object at 0x7fd30059abd0>
2025-05-07T20:32:01.9366731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9367228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3000ba480>}
2025-05-07T20:32:01.9367987Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9368179Z context = <triton._C.libtriton.ir.context object at 0x7fd3005431b0>
2025-05-07T20:32:01.9368184Z 
2025-05-07T20:32:01.9368345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9368614Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9368716Z                            module_map=module_map)
2025-05-07T20:32:01.9368879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9368973Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9369045Z E       ^
2025-05-07T20:32:01.9369399Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9369404Z 
2025-05-07T20:32:01.9369918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9369928Z 
2025-05-07T20:32:01.9370029Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9370254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9370328Z     T=1,
2025-05-07T20:32:01.9370406Z     D=7168,
2025-05-07T20:32:01.9370486Z     scale_ub=None,
2025-05-07T20:32:01.9370571Z     contiguous=True,
2025-05-07T20:32:01.9370659Z     compiled=False,
2025-05-07T20:32:01.9370730Z )
2025-05-07T20:32:01.9370944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9371112Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9371116Z 
2025-05-07T20:32:01.9371192Z     @given(
2025-05-07T20:32:01.9371309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9371409Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9371520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9371695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9371809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9371881Z     )
2025-05-07T20:32:01.9372131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9372220Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9372290Z         self,
2025-05-07T20:32:01.9372367Z         T: int,
2025-05-07T20:32:01.9372437Z         D: int,
2025-05-07T20:32:01.9372534Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9372622Z         contiguous: bool,
2025-05-07T20:32:01.9372705Z         compiled: bool,
2025-05-07T20:32:01.9372778Z     ) -> None:
2025-05-07T20:32:01.9372871Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9372944Z     
2025-05-07T20:32:01.9373118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9373191Z     
2025-05-07T20:32:01.9373279Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9373412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9373544Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9373623Z         x0 = x[:, :D]
2025-05-07T20:32:01.9373707Z         x1 = x[:, D:]
2025-05-07T20:32:01.9373774Z     
2025-05-07T20:32:01.9373855Z         if contiguous:
2025-05-07T20:32:01.9373946Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9374033Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9374106Z     
2025-05-07T20:32:01.9374198Z         if scale_ub is not None:
2025-05-07T20:32:01.9374300Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9374437Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9374506Z             )
2025-05-07T20:32:01.9374579Z         else:
2025-05-07T20:32:01.9374675Z             scale_ub_tensor = None
2025-05-07T20:32:01.9374743Z     
2025-05-07T20:32:01.9374869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9374965Z             op = silu_mul_quant
2025-05-07T20:32:01.9375054Z             if compiled:
2025-05-07T20:32:01.9375149Z                 op = torch.compile(op)
2025-05-07T20:32:01.9375254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9375323Z     
2025-05-07T20:32:01.9375411Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9375415Z 
2025-05-07T20:32:01.9375518Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9375641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9375743Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9375839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9376341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9376441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9376800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9377107Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9377460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9377553Z     kernel = self.compile(
2025-05-07T20:32:01.9377940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9378116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9378242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9378246Z 
2025-05-07T20:32:01.9378459Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfe0e2d0>
2025-05-07T20:32:01.9379235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9379774Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3000b9da0>}
2025-05-07T20:32:01.9380532Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9380722Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfefa930>
2025-05-07T20:32:01.9380726Z 
2025-05-07T20:32:01.9380890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9381149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9381252Z                            module_map=module_map)
2025-05-07T20:32:01.9381415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9381517Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9381638Z E       ^
2025-05-07T20:32:01.9381990Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9381994Z 
2025-05-07T20:32:01.9382402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9382407Z 
2025-05-07T20:32:01.9382513Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9382733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9382805Z     T=16384,
2025-05-07T20:32:01.9382884Z     D=7168,
2025-05-07T20:32:01.9382962Z     scale_ub=1200.0,
2025-05-07T20:32:01.9383048Z     contiguous=False,
2025-05-07T20:32:01.9383126Z     compiled=True,
2025-05-07T20:32:01.9383194Z )
2025-05-07T20:32:01.9383413Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9383591Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9383601Z 
2025-05-07T20:32:01.9383680Z     @given(
2025-05-07T20:32:01.9383825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9383931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9384056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9384177Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9384287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9384362Z     )
2025-05-07T20:32:01.9384604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9384694Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9384770Z         self,
2025-05-07T20:32:01.9384844Z         T: int,
2025-05-07T20:32:01.9384920Z         D: int,
2025-05-07T20:32:01.9385016Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9385100Z         contiguous: bool,
2025-05-07T20:32:01.9385185Z         compiled: bool,
2025-05-07T20:32:01.9385346Z     ) -> None:
2025-05-07T20:32:01.9385444Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9385517Z     
2025-05-07T20:32:01.9385690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9385761Z     
2025-05-07T20:32:01.9385852Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9385972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9386056Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9386137Z         x0 = x[:, :D]
2025-05-07T20:32:01.9386212Z         x1 = x[:, D:]
2025-05-07T20:32:01.9386283Z     
2025-05-07T20:32:01.9386368Z         if contiguous:
2025-05-07T20:32:01.9386455Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9386540Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9386615Z     
2025-05-07T20:32:01.9386703Z         if scale_ub is not None:
2025-05-07T20:32:01.9386803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9386943Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9387059Z             )
2025-05-07T20:32:01.9387132Z         else:
2025-05-07T20:32:01.9387228Z             scale_ub_tensor = None
2025-05-07T20:32:01.9387298Z     
2025-05-07T20:32:01.9387430Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9387517Z             op = silu_mul_quant
2025-05-07T20:32:01.9387598Z             if compiled:
2025-05-07T20:32:01.9387698Z                 op = torch.compile(op)
2025-05-07T20:32:01.9387801Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9387869Z     
2025-05-07T20:32:01.9387961Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9387965Z 
2025-05-07T20:32:01.9388058Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9388182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9388279Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9388374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9388753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9388895Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9389391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9389490Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9389847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9390067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9390413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9390503Z     kernel = self.compile(
2025-05-07T20:32:01.9390887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9391059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9391188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9391193Z 
2025-05-07T20:32:01.9391405Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfe31010>
2025-05-07T20:32:01.9392169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9392672Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfe1ca40>}
2025-05-07T20:32:01.9393417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9393684Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfe112b0>
2025-05-07T20:32:01.9393692Z 
2025-05-07T20:32:01.9393854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9394116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9394222Z                            module_map=module_map)
2025-05-07T20:32:01.9394378Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9394470Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9394548Z E       ^
2025-05-07T20:32:01.9394900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9394905Z 
2025-05-07T20:32:01.9395318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9395323Z 
2025-05-07T20:32:01.9395427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9395691Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9395766Z     T=1,
2025-05-07T20:32:01.9395839Z     D=7168,
2025-05-07T20:32:01.9395919Z     scale_ub=None,
2025-05-07T20:32:01.9396007Z     contiguous=False,
2025-05-07T20:32:01.9396087Z     compiled=False,
2025-05-07T20:32:01.9396158Z )
2025-05-07T20:32:01.9396373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9396534Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.9396538Z 
2025-05-07T20:32:01.9396618Z     @given(
2025-05-07T20:32:01.9396733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9396827Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9396942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9397053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9397168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9397294Z     )
2025-05-07T20:32:01.9397535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9397629Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9397702Z         self,
2025-05-07T20:32:01.9397775Z         T: int,
2025-05-07T20:32:01.9397853Z         D: int,
2025-05-07T20:32:01.9397948Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9398035Z         contiguous: bool,
2025-05-07T20:32:01.9398117Z         compiled: bool,
2025-05-07T20:32:01.9398191Z     ) -> None:
2025-05-07T20:32:01.9398282Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9398359Z     
2025-05-07T20:32:01.9398525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9398598Z     
2025-05-07T20:32:01.9398691Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9398811Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9398900Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9398983Z         x0 = x[:, :D]
2025-05-07T20:32:01.9399063Z         x1 = x[:, D:]
2025-05-07T20:32:01.9399138Z     
2025-05-07T20:32:01.9399216Z         if contiguous:
2025-05-07T20:32:01.9399303Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9399394Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9399463Z     
2025-05-07T20:32:01.9399548Z         if scale_ub is not None:
2025-05-07T20:32:01.9399654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9399785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9399857Z             )
2025-05-07T20:32:01.9399929Z         else:
2025-05-07T20:32:01.9400017Z             scale_ub_tensor = None
2025-05-07T20:32:01.9400087Z     
2025-05-07T20:32:01.9400217Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9400305Z             op = silu_mul_quant
2025-05-07T20:32:01.9400392Z             if compiled:
2025-05-07T20:32:01.9400489Z                 op = torch.compile(op)
2025-05-07T20:32:01.9400698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9400775Z     
2025-05-07T20:32:01.9400862Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9400866Z 
2025-05-07T20:32:01.9400959Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9401087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9401185Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9401280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9401778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9401872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9402231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9402451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9402798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9402937Z     kernel = self.compile(
2025-05-07T20:32:01.9403411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9403586Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9403713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9403718Z 
2025-05-07T20:32:01.9403947Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfe7e750>
2025-05-07T20:32:01.9404744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9405249Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfe1d8a0>}
2025-05-07T20:32:01.9406046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9406235Z context = <triton._C.libtriton.ir.context object at 0x7fd1bff4adb0>
2025-05-07T20:32:01.9406239Z 
2025-05-07T20:32:01.9406401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9406665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9406768Z                            module_map=module_map)
2025-05-07T20:32:01.9406932Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9407028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9407100Z E       ^
2025-05-07T20:32:01.9407459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9407469Z 
2025-05-07T20:32:01.9407881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9407885Z 
2025-05-07T20:32:01.9407987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9408206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9408279Z     T=2048,
2025-05-07T20:32:01.9408354Z     D=7168,
2025-05-07T20:32:01.9408435Z     scale_ub=None,
2025-05-07T20:32:01.9408516Z     contiguous=False,
2025-05-07T20:32:01.9408599Z     compiled=True,
2025-05-07T20:32:01.9408670Z )
2025-05-07T20:32:01.9408883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9409057Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9409062Z 
2025-05-07T20:32:01.9409136Z     @given(
2025-05-07T20:32:01.9409339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9409437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9409547Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9409660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9409768Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9409837Z     )
2025-05-07T20:32:01.9410083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9410172Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9410245Z         self,
2025-05-07T20:32:01.9410322Z         T: int,
2025-05-07T20:32:01.9410393Z         D: int,
2025-05-07T20:32:01.9410486Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9410572Z         contiguous: bool,
2025-05-07T20:32:01.9410657Z         compiled: bool,
2025-05-07T20:32:01.9410736Z     ) -> None:
2025-05-07T20:32:01.9410827Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9410897Z     
2025-05-07T20:32:01.9411113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9411187Z     
2025-05-07T20:32:01.9411278Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9411403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9411487Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9411564Z         x0 = x[:, :D]
2025-05-07T20:32:01.9411644Z         x1 = x[:, D:]
2025-05-07T20:32:01.9411712Z     
2025-05-07T20:32:01.9411792Z         if contiguous:
2025-05-07T20:32:01.9411882Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9411967Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9412036Z     
2025-05-07T20:32:01.9412122Z         if scale_ub is not None:
2025-05-07T20:32:01.9412224Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9412357Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9412435Z             )
2025-05-07T20:32:01.9412506Z         else:
2025-05-07T20:32:01.9412605Z             scale_ub_tensor = None
2025-05-07T20:32:01.9412728Z     
2025-05-07T20:32:01.9412855Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9412946Z             op = silu_mul_quant
2025-05-07T20:32:01.9413028Z             if compiled:
2025-05-07T20:32:01.9413121Z                 op = torch.compile(op)
2025-05-07T20:32:01.9413226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9413298Z     
2025-05-07T20:32:01.9413384Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9413389Z 
2025-05-07T20:32:01.9413482Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9413606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9413700Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9413800Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9414168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9414266Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9414773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9414866Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9415228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9415448Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9415789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9415882Z     kernel = self.compile(
2025-05-07T20:32:01.9416262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9416437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9416642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9416651Z 
2025-05-07T20:32:01.9416856Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bff41350>
2025-05-07T20:32:01.9417628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9418129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfe1eb60>}
2025-05-07T20:32:01.9418881Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9419069Z context = <triton._C.libtriton.ir.context object at 0x7fd1bff299b0>
2025-05-07T20:32:01.9419073Z 
2025-05-07T20:32:01.9419285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9419548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9419650Z                            module_map=module_map)
2025-05-07T20:32:01.9419812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9419907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9419981Z E       ^
2025-05-07T20:32:01.9420334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9420339Z 
2025-05-07T20:32:01.9420748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9420753Z 
2025-05-07T20:32:01.9420857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9421077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9421156Z     T=4096,
2025-05-07T20:32:01.9421280Z     D=7168,
2025-05-07T20:32:01.9421359Z     scale_ub=None,
2025-05-07T20:32:01.9421441Z     contiguous=False,
2025-05-07T20:32:01.9421521Z     compiled=True,
2025-05-07T20:32:01.9421590Z )
2025-05-07T20:32:01.9421802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9421974Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9421979Z 
2025-05-07T20:32:01.9422053Z     @given(
2025-05-07T20:32:01.9422170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9422268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9422377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9422493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9422602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9422673Z     )
2025-05-07T20:32:01.9422927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9423023Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9423100Z         self,
2025-05-07T20:32:01.9423172Z         T: int,
2025-05-07T20:32:01.9423246Z         D: int,
2025-05-07T20:32:01.9423340Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9423425Z         contiguous: bool,
2025-05-07T20:32:01.9423506Z         compiled: bool,
2025-05-07T20:32:01.9423580Z     ) -> None:
2025-05-07T20:32:01.9423669Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9423738Z     
2025-05-07T20:32:01.9423907Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9423976Z     
2025-05-07T20:32:01.9424065Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9424191Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9424276Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9424351Z         x0 = x[:, :D]
2025-05-07T20:32:01.9424430Z         x1 = x[:, D:]
2025-05-07T20:32:01.9424499Z     
2025-05-07T20:32:01.9424663Z         if contiguous:
2025-05-07T20:32:01.9424756Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9424841Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9424913Z     
2025-05-07T20:32:01.9424998Z         if scale_ub is not None:
2025-05-07T20:32:01.9425100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9425232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9425301Z             )
2025-05-07T20:32:01.9425373Z         else:
2025-05-07T20:32:01.9425465Z             scale_ub_tensor = None
2025-05-07T20:32:01.9425535Z     
2025-05-07T20:32:01.9425660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9425748Z             op = silu_mul_quant
2025-05-07T20:32:01.9425830Z             if compiled:
2025-05-07T20:32:01.9425925Z                 op = torch.compile(op)
2025-05-07T20:32:01.9426027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9426093Z     
2025-05-07T20:32:01.9426189Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9426237Z 
2025-05-07T20:32:01.9426330Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9426453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9426552Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9426646Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9427012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9427102Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9427595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9427689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9428044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9428268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9428676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9428766Z     kernel = self.compile(
2025-05-07T20:32:01.9429150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9429323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9429445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9429450Z 
2025-05-07T20:32:01.9429657Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bffee910>
2025-05-07T20:32:01.9430422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9430928Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfe1fe20>}
2025-05-07T20:32:01.9431684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9431871Z context = <triton._C.libtriton.ir.context object at 0x7fd1bff0af70>
2025-05-07T20:32:01.9431876Z 
2025-05-07T20:32:01.9432038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9432297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9432403Z                            module_map=module_map)
2025-05-07T20:32:01.9432564Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9432655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9432731Z E       ^
2025-05-07T20:32:01.9433164Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9433171Z 
2025-05-07T20:32:01.9433582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9433590Z 
2025-05-07T20:32:01.9433686Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9433910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9433987Z     T=16384,
2025-05-07T20:32:01.9434058Z     D=5120,
2025-05-07T20:32:01.9434136Z     scale_ub=1200.0,
2025-05-07T20:32:01.9434224Z     contiguous=False,
2025-05-07T20:32:01.9434303Z     compiled=False,
2025-05-07T20:32:01.9434373Z )
2025-05-07T20:32:01.9434588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9434764Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9434809Z 
2025-05-07T20:32:01.9434897Z     @given(
2025-05-07T20:32:01.9435015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9435108Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9435219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9435333Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9435441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9435513Z     )
2025-05-07T20:32:01.9435756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9435845Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9435922Z         self,
2025-05-07T20:32:01.9435995Z         T: int,
2025-05-07T20:32:01.9436069Z         D: int,
2025-05-07T20:32:01.9436168Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9436254Z         contiguous: bool,
2025-05-07T20:32:01.9436338Z         compiled: bool,
2025-05-07T20:32:01.9436411Z     ) -> None:
2025-05-07T20:32:01.9436515Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9436629Z     
2025-05-07T20:32:01.9436793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9436862Z     
2025-05-07T20:32:01.9436952Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9437072Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9437154Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9437238Z         x0 = x[:, :D]
2025-05-07T20:32:01.9437319Z         x1 = x[:, D:]
2025-05-07T20:32:01.9437387Z     
2025-05-07T20:32:01.9437468Z         if contiguous:
2025-05-07T20:32:01.9437560Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9437647Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9437715Z     
2025-05-07T20:32:01.9437801Z         if scale_ub is not None:
2025-05-07T20:32:01.9437907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9438038Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9438112Z             )
2025-05-07T20:32:01.9438197Z         else:
2025-05-07T20:32:01.9438288Z             scale_ub_tensor = None
2025-05-07T20:32:01.9438356Z     
2025-05-07T20:32:01.9438712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9438833Z             op = silu_mul_quant
2025-05-07T20:32:01.9438915Z             if compiled:
2025-05-07T20:32:01.9439015Z                 op = torch.compile(op)
2025-05-07T20:32:01.9439118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9439188Z     
2025-05-07T20:32:01.9439274Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9439278Z 
2025-05-07T20:32:01.9439372Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9439501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9439597Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9439691Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9440334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9440435Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9440795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9441014Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9441354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9441443Z     kernel = self.compile(
2025-05-07T20:32:01.9441823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9441994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9442119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9442123Z 
2025-05-07T20:32:01.9442334Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfb43e10>
2025-05-07T20:32:01.9443162Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9443727Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfb38d60>}
2025-05-07T20:32:01.9444482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9444693Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfb80470>
2025-05-07T20:32:01.9444698Z 
2025-05-07T20:32:01.9444883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9445152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9445323Z                            module_map=module_map)
2025-05-07T20:32:01.9445482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9445580Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9445654Z E       ^
2025-05-07T20:32:01.9446011Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9446016Z 
2025-05-07T20:32:01.9446428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9446432Z 
2025-05-07T20:32:01.9446533Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9446758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9446834Z     T=16384,
2025-05-07T20:32:01.9446907Z     D=5120,
2025-05-07T20:32:01.9446986Z     scale_ub=1200.0,
2025-05-07T20:32:01.9447071Z     contiguous=True,
2025-05-07T20:32:01.9447159Z     compiled=True,
2025-05-07T20:32:01.9447228Z )
2025-05-07T20:32:01.9447440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9447612Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9447616Z 
2025-05-07T20:32:01.9447695Z     @given(
2025-05-07T20:32:01.9447810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9447907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9448016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9448131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9448241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9448307Z     )
2025-05-07T20:32:01.9448550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9448638Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9448793Z         self,
2025-05-07T20:32:01.9448874Z         T: int,
2025-05-07T20:32:01.9448946Z         D: int,
2025-05-07T20:32:01.9449039Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9449132Z         contiguous: bool,
2025-05-07T20:32:01.9449213Z         compiled: bool,
2025-05-07T20:32:01.9449288Z     ) -> None:
2025-05-07T20:32:01.9449385Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9449455Z     
2025-05-07T20:32:01.9449621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9449690Z     
2025-05-07T20:32:01.9449780Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9449905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9449989Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9450066Z         x0 = x[:, :D]
2025-05-07T20:32:01.9450144Z         x1 = x[:, D:]
2025-05-07T20:32:01.9450212Z     
2025-05-07T20:32:01.9450292Z         if contiguous:
2025-05-07T20:32:01.9450382Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9450515Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9450587Z     
2025-05-07T20:32:01.9450677Z         if scale_ub is not None:
2025-05-07T20:32:01.9450777Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9450909Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9450983Z             )
2025-05-07T20:32:01.9451056Z         else:
2025-05-07T20:32:01.9451149Z             scale_ub_tensor = None
2025-05-07T20:32:01.9451217Z     
2025-05-07T20:32:01.9451341Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9451430Z             op = silu_mul_quant
2025-05-07T20:32:01.9451511Z             if compiled:
2025-05-07T20:32:01.9451605Z                 op = torch.compile(op)
2025-05-07T20:32:01.9451710Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9451778Z     
2025-05-07T20:32:01.9451867Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9451871Z 
2025-05-07T20:32:01.9451967Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9452098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9452241Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9452334Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9452699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9452790Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9453279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9453371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9453727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9453945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9454287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9454382Z     kernel = self.compile(
2025-05-07T20:32:01.9454761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9454935Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9455057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9455061Z 
2025-05-07T20:32:01.9455267Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfb4ea50>
2025-05-07T20:32:01.9456038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9456612Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfb3a200>}
2025-05-07T20:32:01.9457369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9457555Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfbe30b0>
2025-05-07T20:32:01.9457559Z 
2025-05-07T20:32:01.9457724Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9457983Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9458085Z                            module_map=module_map)
2025-05-07T20:32:01.9458244Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9458337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9458411Z E       ^
2025-05-07T20:32:01.9458771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9458842Z 
2025-05-07T20:32:01.9459258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9459806Z 
2025-05-07T20:32:01.9459904Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9460314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9460703Z     T=16384,
2025-05-07T20:32:01.9460889Z     D=5120,
2025-05-07T20:32:01.9461070Z     scale_ub=None,
2025-05-07T20:32:01.9461294Z     contiguous=False,
2025-05-07T20:32:01.9461583Z     compiled=True,
2025-05-07T20:32:01.9461815Z )
2025-05-07T20:32:01.9462204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9462721Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9463058Z 
2025-05-07T20:32:01.9463135Z     @given(
2025-05-07T20:32:01.9463409Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9463828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9464168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9464535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9464931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9465246Z     )
2025-05-07T20:32:01.9465643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9466161Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9466416Z         self,
2025-05-07T20:32:01.9466616Z         T: int,
2025-05-07T20:32:01.9466818Z         D: int,
2025-05-07T20:32:01.9467038Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9467331Z         contiguous: bool,
2025-05-07T20:32:01.9467584Z         compiled: bool,
2025-05-07T20:32:01.9467815Z     ) -> None:
2025-05-07T20:32:01.9468033Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9468288Z     
2025-05-07T20:32:01.9468584Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9468967Z     
2025-05-07T20:32:01.9469164Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9474894Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9475223Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9475462Z         x0 = x[:, :D]
2025-05-07T20:32:01.9475674Z         x1 = x[:, D:]
2025-05-07T20:32:01.9475876Z     
2025-05-07T20:32:01.9476064Z         if contiguous:
2025-05-07T20:32:01.9476292Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9476549Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9476787Z     
2025-05-07T20:32:01.9476971Z         if scale_ub is not None:
2025-05-07T20:32:01.9477236Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9477568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9477880Z             )
2025-05-07T20:32:01.9478069Z         else:
2025-05-07T20:32:01.9478277Z             scale_ub_tensor = None
2025-05-07T20:32:01.9478531Z     
2025-05-07T20:32:01.9478879Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9479199Z             op = silu_mul_quant
2025-05-07T20:32:01.9479444Z             if compiled:
2025-05-07T20:32:01.9479691Z                 op = torch.compile(op)
2025-05-07T20:32:01.9479983Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9480249Z     
2025-05-07T20:32:01.9480431Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9480597Z 
2025-05-07T20:32:01.9480699Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9480990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9481317Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9481589Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9482289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9482850Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9483897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9484643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9485183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9485862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9486524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9487058Z     kernel = self.compile(
2025-05-07T20:32:01.9487600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9488247Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9488641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9488870Z 
2025-05-07T20:32:01.9489082Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfcd3f10>
2025-05-07T20:32:01.9490201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9491579Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfb3ad40>}
2025-05-07T20:32:01.9492917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9493939Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfccc530>
2025-05-07T20:32:01.9494222Z 
2025-05-07T20:32:01.9494391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9494911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9495367Z                            module_map=module_map)
2025-05-07T20:32:01.9495730Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9496075Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9496324Z E       ^
2025-05-07T20:32:01.9496789Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9497238Z 
2025-05-07T20:32:01.9497655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9498166Z 
2025-05-07T20:32:01.9498271Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9498672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9499070Z     T=2048,
2025-05-07T20:32:01.9499249Z     D=5120,
2025-05-07T20:32:01.9499545Z     scale_ub=None,
2025-05-07T20:32:01.9499756Z     contiguous=False,
2025-05-07T20:32:01.9499973Z     compiled=True,
2025-05-07T20:32:01.9500163Z )
2025-05-07T20:32:01.9500479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9500966Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9501234Z 
2025-05-07T20:32:01.9501313Z     @given(
2025-05-07T20:32:01.9501531Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9501835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9502134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9502451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9502772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9503053Z     )
2025-05-07T20:32:01.9503390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9503877Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9504111Z         self,
2025-05-07T20:32:01.9504293Z         T: int,
2025-05-07T20:32:01.9504480Z         D: int,
2025-05-07T20:32:01.9504694Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9504961Z         contiguous: bool,
2025-05-07T20:32:01.9505190Z         compiled: bool,
2025-05-07T20:32:01.9505405Z     ) -> None:
2025-05-07T20:32:01.9505608Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9505847Z     
2025-05-07T20:32:01.9506111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9506443Z     
2025-05-07T20:32:01.9506629Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9506914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9507208Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9507440Z         x0 = x[:, :D]
2025-05-07T20:32:01.9507650Z         x1 = x[:, D:]
2025-05-07T20:32:01.9507848Z     
2025-05-07T20:32:01.9508025Z         if contiguous:
2025-05-07T20:32:01.9508257Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9508555Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9508787Z     
2025-05-07T20:32:01.9508971Z         if scale_ub is not None:
2025-05-07T20:32:01.9509238Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9509562Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9509863Z             )
2025-05-07T20:32:01.9510048Z         else:
2025-05-07T20:32:01.9510245Z             scale_ub_tensor = None
2025-05-07T20:32:01.9510489Z     
2025-05-07T20:32:01.9510712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9511015Z             op = silu_mul_quant
2025-05-07T20:32:01.9511257Z             if compiled:
2025-05-07T20:32:01.9511495Z                 op = torch.compile(op)
2025-05-07T20:32:01.9511778Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9512049Z     
2025-05-07T20:32:01.9512236Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9512399Z 
2025-05-07T20:32:01.9512500Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9512783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9513108Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9513379Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9513954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9514528Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9515182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9515856Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9516381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9517055Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9517855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9518386Z     kernel = self.compile(
2025-05-07T20:32:01.9518923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9519577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9519964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9520194Z 
2025-05-07T20:32:01.9520399Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfcfc8d0>
2025-05-07T20:32:01.9521465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9522827Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfc6c7c0>}
2025-05-07T20:32:01.9524365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9525395Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfcf5c30>
2025-05-07T20:32:01.9525683Z 
2025-05-07T20:32:01.9525846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9526359Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9526824Z                            module_map=module_map)
2025-05-07T20:32:01.9527179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9527526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9527779Z E       ^
2025-05-07T20:32:01.9528237Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9528755Z 
2025-05-07T20:32:01.9529175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9529695Z 
2025-05-07T20:32:01.9529795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9530201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9530595Z     T=2048,
2025-05-07T20:32:01.9530785Z     D=5120,
2025-05-07T20:32:01.9530971Z     scale_ub=1200.0,
2025-05-07T20:32:01.9531186Z     contiguous=False,
2025-05-07T20:32:01.9531402Z     compiled=True,
2025-05-07T20:32:01.9531601Z )
2025-05-07T20:32:01.9531913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9532410Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9532690Z 
2025-05-07T20:32:01.9532765Z     @given(
2025-05-07T20:32:01.9532993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9533312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9533613Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9533944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9534260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9534539Z     )
2025-05-07T20:32:01.9534882Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9535315Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9535548Z         self,
2025-05-07T20:32:01.9535734Z         T: int,
2025-05-07T20:32:01.9535926Z         D: int,
2025-05-07T20:32:01.9536142Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9536413Z         contiguous: bool,
2025-05-07T20:32:01.9536646Z         compiled: bool,
2025-05-07T20:32:01.9536864Z     ) -> None:
2025-05-07T20:32:01.9537075Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9537323Z     
2025-05-07T20:32:01.9537707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9538044Z     
2025-05-07T20:32:01.9538232Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9538774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9539089Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9539324Z         x0 = x[:, :D]
2025-05-07T20:32:01.9539534Z         x1 = x[:, D:]
2025-05-07T20:32:01.9539737Z     
2025-05-07T20:32:01.9539920Z         if contiguous:
2025-05-07T20:32:01.9540147Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9540401Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9540636Z     
2025-05-07T20:32:01.9540821Z         if scale_ub is not None:
2025-05-07T20:32:01.9541086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9541415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9541719Z             )
2025-05-07T20:32:01.9541906Z         else:
2025-05-07T20:32:01.9542212Z             scale_ub_tensor = None
2025-05-07T20:32:01.9542460Z     
2025-05-07T20:32:01.9542686Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9542995Z             op = silu_mul_quant
2025-05-07T20:32:01.9543238Z             if compiled:
2025-05-07T20:32:01.9543478Z                 op = torch.compile(op)
2025-05-07T20:32:01.9543771Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9544041Z     
2025-05-07T20:32:01.9544226Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9544391Z 
2025-05-07T20:32:01.9544490Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9544788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9545108Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9545388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9545948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9546505Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9547242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9547926Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9548458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9549138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9549805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9550340Z     kernel = self.compile(
2025-05-07T20:32:01.9550878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9551534Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9551930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9552166Z 
2025-05-07T20:32:01.9552381Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfada990>
2025-05-07T20:32:01.9553450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9554813Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfc6d300>}
2025-05-07T20:32:01.9556152Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9557179Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfaa2f70>
2025-05-07T20:32:01.9557468Z 
2025-05-07T20:32:01.9557757Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9558279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9558742Z                            module_map=module_map)
2025-05-07T20:32:01.9559108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9559453Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9559706Z E       ^
2025-05-07T20:32:01.9560168Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9560615Z 
2025-05-07T20:32:01.9561038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9561551Z 
2025-05-07T20:32:01.9561654Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9562070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9562524Z     T=4096,
2025-05-07T20:32:01.9562709Z     D=5120,
2025-05-07T20:32:01.9562899Z     scale_ub=1200.0,
2025-05-07T20:32:01.9563119Z     contiguous=True,
2025-05-07T20:32:01.9563438Z     compiled=True,
2025-05-07T20:32:01.9563635Z )
2025-05-07T20:32:01.9563952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9564443Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9564712Z 
2025-05-07T20:32:01.9564788Z     @given(
2025-05-07T20:32:01.9565018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9565326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9565622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9565948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9566270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9566549Z     )
2025-05-07T20:32:01.9566898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9567390Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9567630Z         self,
2025-05-07T20:32:01.9567818Z         T: int,
2025-05-07T20:32:01.9568012Z         D: int,
2025-05-07T20:32:01.9568228Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9568488Z         contiguous: bool,
2025-05-07T20:32:01.9568718Z         compiled: bool,
2025-05-07T20:32:01.9568940Z     ) -> None:
2025-05-07T20:32:01.9569143Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9569376Z     
2025-05-07T20:32:01.9569647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9569978Z     
2025-05-07T20:32:01.9570171Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9570460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9570760Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9570995Z         x0 = x[:, :D]
2025-05-07T20:32:01.9571206Z         x1 = x[:, D:]
2025-05-07T20:32:01.9571409Z     
2025-05-07T20:32:01.9571601Z         if contiguous:
2025-05-07T20:32:01.9571828Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9572076Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9572313Z     
2025-05-07T20:32:01.9572502Z         if scale_ub is not None:
2025-05-07T20:32:01.9572768Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9573098Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9573399Z             )
2025-05-07T20:32:01.9573589Z         else:
2025-05-07T20:32:01.9573798Z             scale_ub_tensor = None
2025-05-07T20:32:01.9574101Z     
2025-05-07T20:32:01.9574327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9574634Z             op = silu_mul_quant
2025-05-07T20:32:01.9574883Z             if compiled:
2025-05-07T20:32:01.9575124Z                 op = torch.compile(op)
2025-05-07T20:32:01.9575413Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9575690Z     
2025-05-07T20:32:01.9575970Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9576141Z 
2025-05-07T20:32:01.9576236Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9576531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9576862Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9577145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9577697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9578254Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9578911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9579594Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9580125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9580814Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9581525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9582051Z     kernel = self.compile(
2025-05-07T20:32:01.9582588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9583238Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9583626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9583856Z 
2025-05-07T20:32:01.9584065Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfa47ad0>
2025-05-07T20:32:01.9585142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9586510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfc6dbc0>}
2025-05-07T20:32:01.9587906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9588926Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfa6c170>
2025-05-07T20:32:01.9589221Z 
2025-05-07T20:32:01.9589389Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9589911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9590380Z                            module_map=module_map)
2025-05-07T20:32:01.9590738Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9591087Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9591347Z E       ^
2025-05-07T20:32:01.9591810Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9592263Z 
2025-05-07T20:32:01.9592677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9593193Z 
2025-05-07T20:32:01.9593294Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9593703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9594092Z     T=128,
2025-05-07T20:32:01.9594280Z     D=5120,
2025-05-07T20:32:01.9594467Z     scale_ub=1200.0,
2025-05-07T20:32:01.9594684Z     contiguous=False,
2025-05-07T20:32:01.9594904Z     compiled=True,
2025-05-07T20:32:01.9595107Z )
2025-05-07T20:32:01.9595419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9595909Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9596270Z 
2025-05-07T20:32:01.9596352Z     @given(
2025-05-07T20:32:01.9596583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9596888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9597190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9597512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9597832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9598117Z     )
2025-05-07T20:32:01.9598462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9598890Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9599127Z         self,
2025-05-07T20:32:01.9599314Z         T: int,
2025-05-07T20:32:01.9599503Z         D: int,
2025-05-07T20:32:01.9599714Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9599979Z         contiguous: bool,
2025-05-07T20:32:01.9600211Z         compiled: bool,
2025-05-07T20:32:01.9600427Z     ) -> None:
2025-05-07T20:32:01.9600695Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9600933Z     
2025-05-07T20:32:01.9601196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9601531Z     
2025-05-07T20:32:01.9601716Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9601999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9602304Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9602539Z         x0 = x[:, :D]
2025-05-07T20:32:01.9602746Z         x1 = x[:, D:]
2025-05-07T20:32:01.9602950Z     
2025-05-07T20:32:01.9603132Z         if contiguous:
2025-05-07T20:32:01.9603465Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9603719Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9603953Z     
2025-05-07T20:32:01.9604140Z         if scale_ub is not None:
2025-05-07T20:32:01.9604413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9604744Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9605047Z             )
2025-05-07T20:32:01.9605292Z         else:
2025-05-07T20:32:01.9605498Z             scale_ub_tensor = None
2025-05-07T20:32:01.9605742Z     
2025-05-07T20:32:01.9605966Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9606275Z             op = silu_mul_quant
2025-05-07T20:32:01.9606519Z             if compiled:
2025-05-07T20:32:01.9606755Z                 op = torch.compile(op)
2025-05-07T20:32:01.9607051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9607319Z     
2025-05-07T20:32:01.9607503Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9607669Z 
2025-05-07T20:32:01.9607763Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9608052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9608375Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9608652Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9609212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9609775Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9610430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9611119Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9611660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9612337Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9613004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9613536Z     kernel = self.compile(
2025-05-07T20:32:01.9614078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9614731Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9615241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9615473Z 
2025-05-07T20:32:01.9615685Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bfae0dd0>
2025-05-07T20:32:01.9616759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9618116Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bfc6eb60>}
2025-05-07T20:32:01.9621913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9623180Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfa55430>
2025-05-07T20:32:01.9623529Z 
2025-05-07T20:32:01.9623697Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9624213Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9624675Z                            module_map=module_map)
2025-05-07T20:32:01.9625036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9625383Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9630207Z E       ^
2025-05-07T20:32:01.9630702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9631160Z 
2025-05-07T20:32:01.9631584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9632097Z 
2025-05-07T20:32:01.9632242Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9632654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9633129Z     T=16384,
2025-05-07T20:32:01.9633321Z     D=7168,
2025-05-07T20:32:01.9633507Z     scale_ub=1200.0,
2025-05-07T20:32:01.9633732Z     contiguous=True,
2025-05-07T20:32:01.9633952Z     compiled=True,
2025-05-07T20:32:01.9634151Z )
2025-05-07T20:32:01.9634469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9634969Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9635245Z 
2025-05-07T20:32:01.9635326Z     @given(
2025-05-07T20:32:01.9635545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9635857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9636158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9636477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9636805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9637099Z     )
2025-05-07T20:32:01.9637444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9637884Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9638121Z         self,
2025-05-07T20:32:01.9638311Z         T: int,
2025-05-07T20:32:01.9638879Z         D: int,
2025-05-07T20:32:01.9639160Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9639427Z         contiguous: bool,
2025-05-07T20:32:01.9639660Z         compiled: bool,
2025-05-07T20:32:01.9639877Z     ) -> None:
2025-05-07T20:32:01.9640085Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9640314Z     
2025-05-07T20:32:01.9640587Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9640923Z     
2025-05-07T20:32:01.9641106Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9641391Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9641698Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9641925Z         x0 = x[:, :D]
2025-05-07T20:32:01.9642239Z         x1 = x[:, D:]
2025-05-07T20:32:01.9642449Z     
2025-05-07T20:32:01.9642626Z         if contiguous:
2025-05-07T20:32:01.9642856Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9643112Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9643409Z     
2025-05-07T20:32:01.9643597Z         if scale_ub is not None:
2025-05-07T20:32:01.9643861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9644188Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9644486Z             )
2025-05-07T20:32:01.9644677Z         else:
2025-05-07T20:32:01.9644882Z             scale_ub_tensor = None
2025-05-07T20:32:01.9645128Z     
2025-05-07T20:32:01.9645359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9645666Z             op = silu_mul_quant
2025-05-07T20:32:01.9645906Z             if compiled:
2025-05-07T20:32:01.9646264Z                 op = torch.compile(op)
2025-05-07T20:32:01.9646560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9646883Z     
2025-05-07T20:32:01.9647072Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9647232Z 
2025-05-07T20:32:01.9647334Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9647620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9647945Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9648227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9648786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9649342Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9649998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9650689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9651220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9651902Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9652633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9653164Z     kernel = self.compile(
2025-05-07T20:32:01.9653698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9654353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9654747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9654972Z 
2025-05-07T20:32:01.9655186Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf9660d0>
2025-05-07T20:32:01.9656268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9657646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf954720>}
2025-05-07T20:32:01.9658994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9660022Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf90e730>
2025-05-07T20:32:01.9660308Z 
2025-05-07T20:32:01.9660471Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9660995Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9661456Z                            module_map=module_map)
2025-05-07T20:32:01.9661816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9662211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9662469Z E       ^
2025-05-07T20:32:01.9662923Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9663378Z 
2025-05-07T20:32:01.9663793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9664310Z 
2025-05-07T20:32:01.9664411Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9664817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9665208Z     T=16384,
2025-05-07T20:32:01.9665395Z     D=5120,
2025-05-07T20:32:01.9665584Z     scale_ub=1200.0,
2025-05-07T20:32:01.9665796Z     contiguous=True,
2025-05-07T20:32:01.9666016Z     compiled=False,
2025-05-07T20:32:01.9666216Z )
2025-05-07T20:32:01.9666591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9666778Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.9666828Z 
2025-05-07T20:32:01.9666905Z     @given(
2025-05-07T20:32:01.9667024Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9667121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9667232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9667349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9667458Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9667534Z     )
2025-05-07T20:32:01.9667781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9667873Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9667957Z         self,
2025-05-07T20:32:01.9668038Z         T: int,
2025-05-07T20:32:01.9668115Z         D: int,
2025-05-07T20:32:01.9668213Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9668303Z         contiguous: bool,
2025-05-07T20:32:01.9668388Z         compiled: bool,
2025-05-07T20:32:01.9668515Z     ) -> None:
2025-05-07T20:32:01.9668609Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9668682Z     
2025-05-07T20:32:01.9668855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9668927Z     
2025-05-07T20:32:01.9669018Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9669145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9669232Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9669310Z         x0 = x[:, :D]
2025-05-07T20:32:01.9669387Z         x1 = x[:, D:]
2025-05-07T20:32:01.9669460Z     
2025-05-07T20:32:01.9669549Z         if contiguous:
2025-05-07T20:32:01.9669636Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9669725Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9669800Z     
2025-05-07T20:32:01.9669887Z         if scale_ub is not None:
2025-05-07T20:32:01.9669990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9670130Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9670210Z             )
2025-05-07T20:32:01.9670284Z         else:
2025-05-07T20:32:01.9670379Z             scale_ub_tensor = None
2025-05-07T20:32:01.9670450Z     
2025-05-07T20:32:01.9670577Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9670666Z             op = silu_mul_quant
2025-05-07T20:32:01.9670747Z             if compiled:
2025-05-07T20:32:01.9670847Z                 op = torch.compile(op)
2025-05-07T20:32:01.9670952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9671026Z     
2025-05-07T20:32:01.9671119Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9671124Z 
2025-05-07T20:32:01.9671220Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9671346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9671448Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9671545Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9672094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9672198Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9672556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9672782Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9673120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9673214Z     kernel = self.compile(
2025-05-07T20:32:01.9673603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9673773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9673946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9673989Z 
2025-05-07T20:32:01.9674227Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf9f3490>
2025-05-07T20:32:01.9675017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9675523Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf955d00>}
2025-05-07T20:32:01.9676271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9676464Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf907ab0>
2025-05-07T20:32:01.9676472Z 
2025-05-07T20:32:01.9676638Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9676941Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9677049Z                            module_map=module_map)
2025-05-07T20:32:01.9677210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9677307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9677383Z E       ^
2025-05-07T20:32:01.9677736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9677741Z 
2025-05-07T20:32:01.9678158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9678162Z 
2025-05-07T20:32:01.9678263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9678489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9678569Z     T=1,
2025-05-07T20:32:01.9678645Z     D=7168,
2025-05-07T20:32:01.9678736Z     scale_ub=1200.0,
2025-05-07T20:32:01.9678822Z     contiguous=False,
2025-05-07T20:32:01.9678904Z     compiled=False,
2025-05-07T20:32:01.9678980Z )
2025-05-07T20:32:01.9679194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9679360Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9679364Z 
2025-05-07T20:32:01.9679441Z     @given(
2025-05-07T20:32:01.9679557Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9679659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9679770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9679883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9679998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9680072Z     )
2025-05-07T20:32:01.9680317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9680459Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9680535Z         self,
2025-05-07T20:32:01.9680612Z         T: int,
2025-05-07T20:32:01.9680693Z         D: int,
2025-05-07T20:32:01.9680789Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9680877Z         contiguous: bool,
2025-05-07T20:32:01.9680962Z         compiled: bool,
2025-05-07T20:32:01.9681040Z     ) -> None:
2025-05-07T20:32:01.9681136Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9681208Z     
2025-05-07T20:32:01.9681375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9681452Z     
2025-05-07T20:32:01.9681541Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9681665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9681760Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9681837Z         x0 = x[:, :D]
2025-05-07T20:32:01.9681916Z         x1 = x[:, D:]
2025-05-07T20:32:01.9682066Z     
2025-05-07T20:32:01.9682150Z         if contiguous:
2025-05-07T20:32:01.9682283Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9682374Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9682446Z     
2025-05-07T20:32:01.9682534Z         if scale_ub is not None:
2025-05-07T20:32:01.9682642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9682774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9682854Z             )
2025-05-07T20:32:01.9682928Z         else:
2025-05-07T20:32:01.9683020Z             scale_ub_tensor = None
2025-05-07T20:32:01.9683093Z     
2025-05-07T20:32:01.9683290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9683377Z             op = silu_mul_quant
2025-05-07T20:32:01.9683462Z             if compiled:
2025-05-07T20:32:01.9683561Z                 op = torch.compile(op)
2025-05-07T20:32:01.9683662Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9683736Z     
2025-05-07T20:32:01.9683828Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9683838Z 
2025-05-07T20:32:01.9683979Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9684107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9684206Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9684310Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9684814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9684905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9685270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9685491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9685837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9685931Z     kernel = self.compile(
2025-05-07T20:32:01.9686321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9686503Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9686625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9686630Z 
2025-05-07T20:32:01.9686837Z self = <triton.compiler.compiler.ASTSource object at 0x7fd300336950>
2025-05-07T20:32:01.9687612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9688114Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf9553a0>}
2025-05-07T20:32:01.9688914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9689107Z context = <triton._C.libtriton.ir.context object at 0x7fd3003bafb0>
2025-05-07T20:32:01.9689112Z 
2025-05-07T20:32:01.9689277Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9689538Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9689642Z                            module_map=module_map)
2025-05-07T20:32:01.9689806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9689900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9689975Z E       ^
2025-05-07T20:32:01.9690334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9690339Z 
2025-05-07T20:32:01.9690804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9690849Z 
2025-05-07T20:32:01.9690958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9691179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9691252Z     T=4096,
2025-05-07T20:32:01.9691334Z     D=7168,
2025-05-07T20:32:01.9691417Z     scale_ub=1200.0,
2025-05-07T20:32:01.9691502Z     contiguous=False,
2025-05-07T20:32:01.9691590Z     compiled=True,
2025-05-07T20:32:01.9691662Z )
2025-05-07T20:32:01.9691878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9692052Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9692057Z 
2025-05-07T20:32:01.9692133Z     @given(
2025-05-07T20:32:01.9692256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9692356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9692471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9692634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9692743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9692819Z     )
2025-05-07T20:32:01.9693060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9693153Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9693234Z         self,
2025-05-07T20:32:01.9693308Z         T: int,
2025-05-07T20:32:01.9693385Z         D: int,
2025-05-07T20:32:01.9693483Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9693570Z         contiguous: bool,
2025-05-07T20:32:01.9693654Z         compiled: bool,
2025-05-07T20:32:01.9693739Z     ) -> None:
2025-05-07T20:32:01.9693831Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9693899Z     
2025-05-07T20:32:01.9694069Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9694146Z     
2025-05-07T20:32:01.9694238Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9694372Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9694460Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9694540Z         x0 = x[:, :D]
2025-05-07T20:32:01.9694620Z         x1 = x[:, D:]
2025-05-07T20:32:01.9694694Z     
2025-05-07T20:32:01.9694776Z         if contiguous:
2025-05-07T20:32:01.9694871Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9694957Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9695028Z     
2025-05-07T20:32:01.9695124Z         if scale_ub is not None:
2025-05-07T20:32:01.9695227Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9695364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9695437Z             )
2025-05-07T20:32:01.9695513Z         else:
2025-05-07T20:32:01.9695607Z             scale_ub_tensor = None
2025-05-07T20:32:01.9695681Z     
2025-05-07T20:32:01.9695811Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9695947Z             op = silu_mul_quant
2025-05-07T20:32:01.9696033Z             if compiled:
2025-05-07T20:32:01.9696130Z                 op = torch.compile(op)
2025-05-07T20:32:01.9696236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9696310Z     
2025-05-07T20:32:01.9696400Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9696412Z 
2025-05-07T20:32:01.9696505Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9696630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9696731Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9696828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9697197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9697289Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9697834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9697974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9698337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9698563Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9698913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9699007Z     kernel = self.compile(
2025-05-07T20:32:01.9699392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9699567Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9699691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9699695Z 
2025-05-07T20:32:01.9699907Z self = <triton.compiler.compiler.ASTSource object at 0x7fd3003ed610>
2025-05-07T20:32:01.9700685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9701230Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3003bc2c0>}
2025-05-07T20:32:01.9701985Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9702172Z context = <triton._C.libtriton.ir.context object at 0x7fd3003b58f0>
2025-05-07T20:32:01.9702176Z 
2025-05-07T20:32:01.9702341Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9702607Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9702715Z                            module_map=module_map)
2025-05-07T20:32:01.9702878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9702974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9703053Z E       ^
2025-05-07T20:32:01.9703408Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9703413Z 
2025-05-07T20:32:01.9703837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9703842Z 
2025-05-07T20:32:01.9703960Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9704203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9704281Z     T=128,
2025-05-07T20:32:01.9704358Z     D=7168,
2025-05-07T20:32:01.9704441Z     scale_ub=1200.0,
2025-05-07T20:32:01.9704526Z     contiguous=False,
2025-05-07T20:32:01.9704652Z     compiled=True,
2025-05-07T20:32:01.9704729Z )
2025-05-07T20:32:01.9704951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9705121Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.9705126Z 
2025-05-07T20:32:01.9705203Z     @given(
2025-05-07T20:32:01.9705326Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9705423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9705536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9705653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9705765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9705842Z     )
2025-05-07T20:32:01.9706086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9706177Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9706301Z         self,
2025-05-07T20:32:01.9706420Z         T: int,
2025-05-07T20:32:01.9706498Z         D: int,
2025-05-07T20:32:01.9706598Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9706686Z         contiguous: bool,
2025-05-07T20:32:01.9706768Z         compiled: bool,
2025-05-07T20:32:01.9706850Z     ) -> None:
2025-05-07T20:32:01.9706944Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9707015Z     
2025-05-07T20:32:01.9707186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9707259Z     
2025-05-07T20:32:01.9707349Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9707469Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9707555Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9707636Z         x0 = x[:, :D]
2025-05-07T20:32:01.9707714Z         x1 = x[:, D:]
2025-05-07T20:32:01.9707786Z     
2025-05-07T20:32:01.9707869Z         if contiguous:
2025-05-07T20:32:01.9707961Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9708046Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9708170Z     
2025-05-07T20:32:01.9708258Z         if scale_ub is not None:
2025-05-07T20:32:01.9708361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9708493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9708569Z             )
2025-05-07T20:32:01.9708644Z         else:
2025-05-07T20:32:01.9708737Z             scale_ub_tensor = None
2025-05-07T20:32:01.9708807Z     
2025-05-07T20:32:01.9708942Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9709030Z             op = silu_mul_quant
2025-05-07T20:32:01.9709112Z             if compiled:
2025-05-07T20:32:01.9709211Z                 op = torch.compile(op)
2025-05-07T20:32:01.9709312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9709384Z     
2025-05-07T20:32:01.9709476Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9709480Z 
2025-05-07T20:32:01.9709576Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9709709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9709811Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9709908Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9710281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9710370Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9710865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9710965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9711322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9711549Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9711891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9712052Z     kernel = self.compile(
2025-05-07T20:32:01.9712446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9712619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9712743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9712747Z 
2025-05-07T20:32:01.9712957Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf7fe610>
2025-05-07T20:32:01.9713729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9714281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3003bce00>}
2025-05-07T20:32:01.9715074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9715268Z context = <triton._C.libtriton.ir.context object at 0x7fd1bfc733b0>
2025-05-07T20:32:01.9715273Z 
2025-05-07T20:32:01.9715434Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9715696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9715803Z                            module_map=module_map)
2025-05-07T20:32:01.9715961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9716055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9716135Z E       ^
2025-05-07T20:32:01.9716490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9716496Z 
2025-05-07T20:32:01.9716960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9716964Z 
2025-05-07T20:32:01.9717066Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9717290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9717367Z     T=2048,
2025-05-07T20:32:01.9717443Z     D=7168,
2025-05-07T20:32:01.9717529Z     scale_ub=None,
2025-05-07T20:32:01.9717610Z     contiguous=True,
2025-05-07T20:32:01.9717693Z     compiled=True,
2025-05-07T20:32:01.9717770Z )
2025-05-07T20:32:01.9717985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9718153Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.9718157Z 
2025-05-07T20:32:01.9718233Z     @given(
2025-05-07T20:32:01.9718355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9718451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9718572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9718686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9718801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9718874Z     )
2025-05-07T20:32:01.9719115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9719213Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9719288Z         self,
2025-05-07T20:32:01.9719364Z         T: int,
2025-05-07T20:32:01.9719445Z         D: int,
2025-05-07T20:32:01.9719538Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9719621Z         contiguous: bool,
2025-05-07T20:32:01.9719710Z         compiled: bool,
2025-05-07T20:32:01.9719786Z     ) -> None:
2025-05-07T20:32:01.9719876Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9719954Z     
2025-05-07T20:32:01.9720122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9720249Z     
2025-05-07T20:32:01.9720346Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9720470Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9720559Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9720636Z         x0 = x[:, :D]
2025-05-07T20:32:01.9720715Z         x1 = x[:, D:]
2025-05-07T20:32:01.9720792Z     
2025-05-07T20:32:01.9720874Z         if contiguous:
2025-05-07T20:32:01.9720961Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9721051Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9721120Z     
2025-05-07T20:32:01.9721206Z         if scale_ub is not None:
2025-05-07T20:32:01.9721311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9721442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9721515Z             )
2025-05-07T20:32:01.9721591Z         else:
2025-05-07T20:32:01.9721682Z             scale_ub_tensor = None
2025-05-07T20:32:01.9721806Z     
2025-05-07T20:32:01.9721938Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9722116Z             op = silu_mul_quant
2025-05-07T20:32:01.9722204Z             if compiled:
2025-05-07T20:32:01.9722302Z                 op = torch.compile(op)
2025-05-07T20:32:01.9722402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9722480Z     
2025-05-07T20:32:01.9722567Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9722572Z 
2025-05-07T20:32:01.9722667Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9722796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9722894Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9722991Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9723495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9723584Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9724091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9724231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9724591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9724817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9725159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9725253Z     kernel = self.compile(
2025-05-07T20:32:01.9725637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9725809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9725941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9725948Z 
2025-05-07T20:32:01.9726157Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf7b37d0>
2025-05-07T20:32:01.9726940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9727442Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd3003bda80>}
2025-05-07T20:32:01.9728196Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9728391Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf75fe30>
2025-05-07T20:32:01.9728395Z 
2025-05-07T20:32:01.9728560Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9728876Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9728984Z                            module_map=module_map)
2025-05-07T20:32:01.9729140Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9729237Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9729312Z E       ^
2025-05-07T20:32:01.9729669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9729679Z 
2025-05-07T20:32:01.9730094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9730098Z 
2025-05-07T20:32:01.9730199Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9730421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9730498Z     T=16384,
2025-05-07T20:32:01.9730619Z     D=5120,
2025-05-07T20:32:01.9730702Z     scale_ub=None,
2025-05-07T20:32:01.9730831Z     contiguous=False,
2025-05-07T20:32:01.9730913Z     compiled=False,
2025-05-07T20:32:01.9730990Z )
2025-05-07T20:32:01.9731204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9731384Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.9731388Z 
2025-05-07T20:32:01.9731465Z     @given(
2025-05-07T20:32:01.9731582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9731683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9731793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9731907Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9732022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9732093Z     )
2025-05-07T20:32:01.9732340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9732432Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9732515Z         self,
2025-05-07T20:32:01.9732633Z         T: int,
2025-05-07T20:32:01.9732710Z         D: int,
2025-05-07T20:32:01.9732809Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9732902Z         contiguous: bool,
2025-05-07T20:32:01.9732987Z         compiled: bool,
2025-05-07T20:32:01.9733066Z     ) -> None:
2025-05-07T20:32:01.9733165Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9733242Z     
2025-05-07T20:32:01.9733409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9733483Z     
2025-05-07T20:32:01.9733573Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9733694Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9735522Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9735534Z 
2025-05-07T20:32:01.9735650Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.9735655Z 
2025-05-07T20:32:01.9735757Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9735977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9736058Z     T=4096,
2025-05-07T20:32:01.9736134Z     D=7168,
2025-05-07T20:32:01.9736215Z     scale_ub=1200.0,
2025-05-07T20:32:01.9736303Z     contiguous=True,
2025-05-07T20:32:01.9736382Z     compiled=True,
2025-05-07T20:32:01.9736454Z )
2025-05-07T20:32:01.9736676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9736886Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9736896Z 
2025-05-07T20:32:01.9736974Z     @given(
2025-05-07T20:32:01.9737088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9737184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9737298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9737410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9737518Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9737595Z     )
2025-05-07T20:32:01.9737839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9737929Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9738007Z         self,
2025-05-07T20:32:01.9738083Z         T: int,
2025-05-07T20:32:01.9738159Z         D: int,
2025-05-07T20:32:01.9738253Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9738672Z         contiguous: bool,
2025-05-07T20:32:01.9738813Z         compiled: bool,
2025-05-07T20:32:01.9738995Z     ) -> None:
2025-05-07T20:32:01.9739087Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9739161Z     
2025-05-07T20:32:01.9739331Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9739402Z     
2025-05-07T20:32:01.9739491Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9739612Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9741400Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9741511Z 
2025-05-07T20:32:01.9741629Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.9741636Z 
2025-05-07T20:32:01.9741738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9741961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9742036Z     T=16384,
2025-05-07T20:32:01.9742115Z     D=7168,
2025-05-07T20:32:01.9742192Z     scale_ub=None,
2025-05-07T20:32:01.9742274Z     contiguous=False,
2025-05-07T20:32:01.9742363Z     compiled=False,
2025-05-07T20:32:01.9742435Z )
2025-05-07T20:32:01.9742649Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9742829Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.9742833Z 
2025-05-07T20:32:01.9742907Z     @given(
2025-05-07T20:32:01.9743023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9743123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9743237Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9743357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9743465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9743536Z     )
2025-05-07T20:32:01.9743783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9743872Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9743945Z         self,
2025-05-07T20:32:01.9744026Z         T: int,
2025-05-07T20:32:01.9744099Z         D: int,
2025-05-07T20:32:01.9744195Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9744310Z         contiguous: bool,
2025-05-07T20:32:01.9744396Z         compiled: bool,
2025-05-07T20:32:01.9744491Z     ) -> None:
2025-05-07T20:32:01.9744586Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9744655Z     
2025-05-07T20:32:01.9744824Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9746666Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9746681Z 
2025-05-07T20:32:01.9746800Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9746805Z 
2025-05-07T20:32:01.9746901Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9747122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9747200Z     T=2048,
2025-05-07T20:32:01.9747274Z     D=7168,
2025-05-07T20:32:01.9747416Z     scale_ub=1200.0,
2025-05-07T20:32:01.9747545Z     contiguous=True,
2025-05-07T20:32:01.9747627Z     compiled=True,
2025-05-07T20:32:01.9747699Z )
2025-05-07T20:32:01.9747915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9748081Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9748085Z 
2025-05-07T20:32:01.9748167Z     @given(
2025-05-07T20:32:01.9748282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9748377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9748489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9748600Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9748708Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9748785Z     )
2025-05-07T20:32:01.9749026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9749122Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9749195Z         self,
2025-05-07T20:32:01.9749320Z         T: int,
2025-05-07T20:32:01.9749399Z         D: int,
2025-05-07T20:32:01.9749492Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9749577Z         contiguous: bool,
2025-05-07T20:32:01.9749665Z         compiled: bool,
2025-05-07T20:32:01.9749740Z     ) -> None:
2025-05-07T20:32:01.9749828Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9749904Z     
2025-05-07T20:32:01.9750071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9750140Z     
2025-05-07T20:32:01.9750232Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9750351Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9752125Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9752137Z 
2025-05-07T20:32:01.9752250Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.9752255Z 
2025-05-07T20:32:01.9752357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9752573Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9752646Z     T=2048,
2025-05-07T20:32:01.9752727Z     D=7168,
2025-05-07T20:32:01.9752804Z     scale_ub=None,
2025-05-07T20:32:01.9752884Z     contiguous=True,
2025-05-07T20:32:01.9752970Z     compiled=False,
2025-05-07T20:32:01.9753040Z )
2025-05-07T20:32:01.9753253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9753471Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9753481Z 
2025-05-07T20:32:01.9753557Z     @given(
2025-05-07T20:32:01.9753674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9753769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9753878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9753994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9754105Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9754175Z     )
2025-05-07T20:32:01.9754420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9754508Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9754578Z         self,
2025-05-07T20:32:01.9754655Z         T: int,
2025-05-07T20:32:01.9754729Z         D: int,
2025-05-07T20:32:01.9754828Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9754913Z         contiguous: bool,
2025-05-07T20:32:01.9755044Z         compiled: bool,
2025-05-07T20:32:01.9755161Z     ) -> None:
2025-05-07T20:32:01.9755256Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9755327Z     
2025-05-07T20:32:01.9755494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9755565Z     
2025-05-07T20:32:01.9759546Z >       x_sign = torch.sign(x)
2025-05-07T20:32:01.9761342Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9761349Z 
2025-05-07T20:32:01.9761480Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:01.9761491Z 
2025-05-07T20:32:01.9761666Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9761891Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9761970Z     T=1,
2025-05-07T20:32:01.9762048Z     D=7168,
2025-05-07T20:32:01.9762131Z     scale_ub=1200.0,
2025-05-07T20:32:01.9762212Z     contiguous=True,
2025-05-07T20:32:01.9762295Z     compiled=False,
2025-05-07T20:32:01.9762368Z )
2025-05-07T20:32:01.9762584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9762749Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.9762759Z 
2025-05-07T20:32:01.9762836Z     @given(
2025-05-07T20:32:01.9762952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9763057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9763173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9763402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9763525Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9763599Z     )
2025-05-07T20:32:01.9763843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9763947Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9764042Z         self,
2025-05-07T20:32:01.9764122Z         T: int,
2025-05-07T20:32:01.9764222Z         D: int,
2025-05-07T20:32:01.9764319Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9764410Z         contiguous: bool,
2025-05-07T20:32:01.9764492Z         compiled: bool,
2025-05-07T20:32:01.9764568Z     ) -> None:
2025-05-07T20:32:01.9764666Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9764737Z     
2025-05-07T20:32:01.9764904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9764982Z     
2025-05-07T20:32:01.9765071Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9765194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9765338Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9765419Z         x0 = x[:, :D]
2025-05-07T20:32:01.9765498Z         x1 = x[:, D:]
2025-05-07T20:32:01.9765574Z     
2025-05-07T20:32:01.9765657Z         if contiguous:
2025-05-07T20:32:01.9765753Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9765845Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9765915Z     
2025-05-07T20:32:01.9766009Z         if scale_ub is not None:
2025-05-07T20:32:01.9766111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9766243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9766319Z             )
2025-05-07T20:32:01.9766395Z         else:
2025-05-07T20:32:01.9766485Z             scale_ub_tensor = None
2025-05-07T20:32:01.9766561Z     
2025-05-07T20:32:01.9766689Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9766829Z             op = silu_mul_quant
2025-05-07T20:32:01.9766917Z             if compiled:
2025-05-07T20:32:01.9767059Z                 op = torch.compile(op)
2025-05-07T20:32:01.9767165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9767242Z     
2025-05-07T20:32:01.9767332Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9767337Z 
2025-05-07T20:32:01.9767434Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9767561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9767662Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9767764Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9768264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9768360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9768728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9768956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9769345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9769437Z     kernel = self.compile(
2025-05-07T20:32:01.9769824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9770002Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9770127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9770131Z 
2025-05-07T20:32:01.9770340Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf643b50>
2025-05-07T20:32:01.9771115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9771617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf661440>}
2025-05-07T20:32:01.9772374Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9772562Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf7733b0>
2025-05-07T20:32:01.9772567Z 
2025-05-07T20:32:01.9772735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9772997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9773101Z                            module_map=module_map)
2025-05-07T20:32:01.9773266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9773365Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9773446Z E       ^
2025-05-07T20:32:01.9773845Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9773850Z 
2025-05-07T20:32:01.9774263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9774268Z 
2025-05-07T20:32:01.9774373Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9774591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9774669Z     T=128,
2025-05-07T20:32:01.9774742Z     D=5120,
2025-05-07T20:32:01.9774820Z     scale_ub=None,
2025-05-07T20:32:01.9774908Z     contiguous=True,
2025-05-07T20:32:01.9774992Z     compiled=False,
2025-05-07T20:32:01.9775064Z )
2025-05-07T20:32:01.9775282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9775498Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9775565Z 
2025-05-07T20:32:01.9775646Z     @given(
2025-05-07T20:32:01.9775767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9775864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9775976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9776096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9776209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9776288Z     )
2025-05-07T20:32:01.9776528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9776621Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9776701Z         self,
2025-05-07T20:32:01.9776774Z         T: int,
2025-05-07T20:32:01.9776852Z         D: int,
2025-05-07T20:32:01.9776955Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9777038Z         contiguous: bool,
2025-05-07T20:32:01.9777120Z         compiled: bool,
2025-05-07T20:32:01.9777204Z     ) -> None:
2025-05-07T20:32:01.9777302Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9777416Z     
2025-05-07T20:32:01.9777586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9777658Z     
2025-05-07T20:32:01.9777749Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9777870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9777956Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9778037Z         x0 = x[:, :D]
2025-05-07T20:32:01.9778114Z         x1 = x[:, D:]
2025-05-07T20:32:01.9778186Z     
2025-05-07T20:32:01.9778268Z         if contiguous:
2025-05-07T20:32:01.9778361Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9778448Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9778525Z     
2025-05-07T20:32:01.9778612Z         if scale_ub is not None:
2025-05-07T20:32:01.9778713Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9778851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9778927Z             )
2025-05-07T20:32:01.9779015Z         else:
2025-05-07T20:32:01.9779106Z             scale_ub_tensor = None
2025-05-07T20:32:01.9779174Z     
2025-05-07T20:32:01.9779303Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9779390Z             op = silu_mul_quant
2025-05-07T20:32:01.9779474Z             if compiled:
2025-05-07T20:32:01.9779573Z                 op = torch.compile(op)
2025-05-07T20:32:01.9779676Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9779746Z     
2025-05-07T20:32:01.9779841Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9779845Z 
2025-05-07T20:32:01.9779937Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9780061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9780160Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9780259Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9780809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9780910Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9781266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9781491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9781830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9781927Z     kernel = self.compile(
2025-05-07T20:32:01.9782312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9782482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9782610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9782614Z 
2025-05-07T20:32:01.9782866Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf608a10>
2025-05-07T20:32:01.9783680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9784177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf662660>}
2025-05-07T20:32:01.9784931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9785123Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf679030>
2025-05-07T20:32:01.9785128Z 
2025-05-07T20:32:01.9785292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9785556Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9785706Z                            module_map=module_map)
2025-05-07T20:32:01.9785865Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9785960Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9786038Z E       ^
2025-05-07T20:32:01.9786390Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9786395Z 
2025-05-07T20:32:01.9786811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9786816Z 
2025-05-07T20:32:01.9786916Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9787137Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9787212Z     T=128,
2025-05-07T20:32:01.9787286Z     D=7168,
2025-05-07T20:32:01.9787365Z     scale_ub=None,
2025-05-07T20:32:01.9787460Z     contiguous=True,
2025-05-07T20:32:01.9787543Z     compiled=False,
2025-05-07T20:32:01.9787613Z )
2025-05-07T20:32:01.9787830Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9787994Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9787999Z 
2025-05-07T20:32:01.9788075Z     @given(
2025-05-07T20:32:01.9788190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9788285Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9788400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9788512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9788620Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9788699Z     )
2025-05-07T20:32:01.9788939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9789034Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9789110Z         self,
2025-05-07T20:32:01.9789232Z         T: int,
2025-05-07T20:32:01.9789312Z         D: int,
2025-05-07T20:32:01.9789405Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9789492Z         contiguous: bool,
2025-05-07T20:32:01.9789575Z         compiled: bool,
2025-05-07T20:32:01.9789648Z     ) -> None:
2025-05-07T20:32:01.9789738Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9789818Z     
2025-05-07T20:32:01.9789983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9790057Z     
2025-05-07T20:32:01.9790147Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9790268Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9790359Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9790438Z         x0 = x[:, :D]
2025-05-07T20:32:01.9790515Z         x1 = x[:, D:]
2025-05-07T20:32:01.9790588Z     
2025-05-07T20:32:01.9790668Z         if contiguous:
2025-05-07T20:32:01.9790814Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9790944Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9791014Z     
2025-05-07T20:32:01.9791102Z         if scale_ub is not None:
2025-05-07T20:32:01.9791207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9791336Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9791410Z             )
2025-05-07T20:32:01.9791488Z         else:
2025-05-07T20:32:01.9791579Z             scale_ub_tensor = None
2025-05-07T20:32:01.9791647Z     
2025-05-07T20:32:01.9791777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9791864Z             op = silu_mul_quant
2025-05-07T20:32:01.9791950Z             if compiled:
2025-05-07T20:32:01.9792044Z                 op = torch.compile(op)
2025-05-07T20:32:01.9792146Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9792221Z     
2025-05-07T20:32:01.9792309Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9792313Z 
2025-05-07T20:32:01.9792412Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9792547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9792685Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9792784Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9793281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9793373Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9793732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9793949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9794320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9794433Z     kernel = self.compile(
2025-05-07T20:32:01.9794820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9795001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9795128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9795132Z 
2025-05-07T20:32:01.9795334Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf40fb10>
2025-05-07T20:32:01.9796101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9796600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf6636a0>}
2025-05-07T20:32:01.9797389Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9797582Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf4180b0>
2025-05-07T20:32:01.9797586Z 
2025-05-07T20:32:01.9797748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9798013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9798117Z                            module_map=module_map)
2025-05-07T20:32:01.9798278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9798371Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9798446Z E       ^
2025-05-07T20:32:01.9798799Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9798804Z 
2025-05-07T20:32:01.9799255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9799299Z 
2025-05-07T20:32:01.9799406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9799626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9799703Z     T=2048,
2025-05-07T20:32:01.9799780Z     D=7168,
2025-05-07T20:32:01.9799859Z     scale_ub=1200.0,
2025-05-07T20:32:01.9799938Z     contiguous=True,
2025-05-07T20:32:01.9800021Z     compiled=False,
2025-05-07T20:32:01.9800091Z )
2025-05-07T20:32:01.9800308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9800479Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.9800484Z 
2025-05-07T20:32:01.9800558Z     @given(
2025-05-07T20:32:01.9800675Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9800769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9800884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9801003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9801163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9801234Z     )
2025-05-07T20:32:01.9801481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9801573Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9801648Z         self,
2025-05-07T20:32:01.9801725Z         T: int,
2025-05-07T20:32:01.9801799Z         D: int,
2025-05-07T20:32:01.9801893Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9801981Z         contiguous: bool,
2025-05-07T20:32:01.9802062Z         compiled: bool,
2025-05-07T20:32:01.9802139Z     ) -> None:
2025-05-07T20:32:01.9802231Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9802301Z     
2025-05-07T20:32:01.9802471Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9804334Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9804345Z 
2025-05-07T20:32:01.9804467Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9804472Z 
2025-05-07T20:32:01.9804572Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9804788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9804868Z     T=1,
2025-05-07T20:32:01.9804944Z     D=5120,
2025-05-07T20:32:01.9805025Z     scale_ub=1200.0,
2025-05-07T20:32:01.9805111Z     contiguous=True,
2025-05-07T20:32:01.9805191Z     compiled=False,
2025-05-07T20:32:01.9805264Z )
2025-05-07T20:32:01.9805523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9805688Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.9805693Z 
2025-05-07T20:32:01.9805771Z     @given(
2025-05-07T20:32:01.9805885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9805978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9806095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9806210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9806320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9806395Z     )
2025-05-07T20:32:01.9806633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9806727Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9806800Z         self,
2025-05-07T20:32:01.9806873Z         T: int,
2025-05-07T20:32:01.9807012Z         D: int,
2025-05-07T20:32:01.9807146Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9807234Z         contiguous: bool,
2025-05-07T20:32:01.9807321Z         compiled: bool,
2025-05-07T20:32:01.9807398Z     ) -> None:
2025-05-07T20:32:01.9807487Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9807558Z     
2025-05-07T20:32:01.9807723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9807795Z     
2025-05-07T20:32:01.9807885Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9808006Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9808093Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9808168Z         x0 = x[:, :D]
2025-05-07T20:32:01.9808242Z         x1 = x[:, D:]
2025-05-07T20:32:01.9808315Z     
2025-05-07T20:32:01.9808394Z         if contiguous:
2025-05-07T20:32:01.9808484Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9808577Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9808651Z     
2025-05-07T20:32:01.9808737Z         if scale_ub is not None:
2025-05-07T20:32:01.9808889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9809021Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9809094Z             )
2025-05-07T20:32:01.9809173Z         else:
2025-05-07T20:32:01.9809264Z             scale_ub_tensor = None
2025-05-07T20:32:01.9809331Z     
2025-05-07T20:32:01.9809463Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9809551Z             op = silu_mul_quant
2025-05-07T20:32:01.9809631Z             if compiled:
2025-05-07T20:32:01.9809726Z                 op = torch.compile(op)
2025-05-07T20:32:01.9809827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9809898Z     
2025-05-07T20:32:01.9809987Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9809991Z 
2025-05-07T20:32:01.9810082Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9810215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9810316Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9810417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9810916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9811010Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9811374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9811593Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9811931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9812026Z     kernel = self.compile(
2025-05-07T20:32:01.9812408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9812586Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9812757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9812765Z 
2025-05-07T20:32:01.9812969Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf4cbbd0>
2025-05-07T20:32:01.9813739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9814236Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf454b80>}
2025-05-07T20:32:01.9815021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9815213Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf4c0230>
2025-05-07T20:32:01.9815259Z 
2025-05-07T20:32:01.9815421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9815685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9815789Z                            module_map=module_map)
2025-05-07T20:32:01.9815949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9816044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9816115Z E       ^
2025-05-07T20:32:01.9816468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9816473Z 
2025-05-07T20:32:01.9816885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9816889Z 
2025-05-07T20:32:01.9816999Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9817222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9817350Z     T=2048,
2025-05-07T20:32:01.9817429Z     D=5120,
2025-05-07T20:32:01.9817510Z     scale_ub=None,
2025-05-07T20:32:01.9817590Z     contiguous=True,
2025-05-07T20:32:01.9817676Z     compiled=False,
2025-05-07T20:32:01.9817743Z )
2025-05-07T20:32:01.9817957Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9818127Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9818132Z 
2025-05-07T20:32:01.9818207Z     @given(
2025-05-07T20:32:01.9818328Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9818421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9818536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9818652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9818764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9818839Z     )
2025-05-07T20:32:01.9819090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9819186Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9819259Z         self,
2025-05-07T20:32:01.9819336Z         T: int,
2025-05-07T20:32:01.9819410Z         D: int,
2025-05-07T20:32:01.9819503Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9819592Z         contiguous: bool,
2025-05-07T20:32:01.9819674Z         compiled: bool,
2025-05-07T20:32:01.9819751Z     ) -> None:
2025-05-07T20:32:01.9819846Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9819915Z     
2025-05-07T20:32:01.9820081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9820156Z     
2025-05-07T20:32:01.9820244Z >       x_sign = torch.sign(x)
2025-05-07T20:32:01.9822069Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9822080Z 
2025-05-07T20:32:01.9822194Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:01.9822199Z 
2025-05-07T20:32:01.9822304Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9822524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9822601Z     T=16384,
2025-05-07T20:32:01.9822679Z     D=5120,
2025-05-07T20:32:01.9822759Z     scale_ub=None,
2025-05-07T20:32:01.9822841Z     contiguous=True,
2025-05-07T20:32:01.9822926Z     compiled=False,
2025-05-07T20:32:01.9823037Z )
2025-05-07T20:32:01.9823256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9823473Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9823479Z 
2025-05-07T20:32:01.9823554Z     @given(
2025-05-07T20:32:01.9823672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9823767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9823882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9823997Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9824128Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9824200Z     )
2025-05-07T20:32:01.9824469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9824560Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9824634Z         self,
2025-05-07T20:32:01.9824712Z         T: int,
2025-05-07T20:32:01.9824790Z         D: int,
2025-05-07T20:32:01.9824886Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9825024Z         contiguous: bool,
2025-05-07T20:32:01.9825108Z         compiled: bool,
2025-05-07T20:32:01.9825186Z     ) -> None:
2025-05-07T20:32:01.9825278Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9825348Z     
2025-05-07T20:32:01.9825525Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9827295Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9827303Z 
2025-05-07T20:32:01.9827419Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9827427Z 
2025-05-07T20:32:01.9827526Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9827742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9827819Z     T=4096,
2025-05-07T20:32:01.9827896Z     D=5120,
2025-05-07T20:32:01.9827974Z     scale_ub=None,
2025-05-07T20:32:01.9828055Z     contiguous=True,
2025-05-07T20:32:01.9828137Z     compiled=False,
2025-05-07T20:32:01.9828213Z )
2025-05-07T20:32:01.9828427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9828594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9828599Z 
2025-05-07T20:32:01.9828676Z     @given(
2025-05-07T20:32:01.9828789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9828883Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9829000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9829157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9829275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9829346Z     )
2025-05-07T20:32:01.9829586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9829678Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9829751Z         self,
2025-05-07T20:32:01.9829825Z         T: int,
2025-05-07T20:32:01.9829901Z         D: int,
2025-05-07T20:32:01.9829993Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9830079Z         contiguous: bool,
2025-05-07T20:32:01.9830168Z         compiled: bool,
2025-05-07T20:32:01.9830242Z     ) -> None:
2025-05-07T20:32:01.9830331Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9830399Z     
2025-05-07T20:32:01.9830564Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9832372Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9832416Z 
2025-05-07T20:32:01.9832530Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9832535Z 
2025-05-07T20:32:01.9832634Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9832850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9832922Z     T=2048,
2025-05-07T20:32:01.9832999Z     D=5120,
2025-05-07T20:32:01.9833078Z     scale_ub=None,
2025-05-07T20:32:01.9833162Z     contiguous=False,
2025-05-07T20:32:01.9833244Z     compiled=False,
2025-05-07T20:32:01.9833358Z )
2025-05-07T20:32:01.9833570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9833739Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.9833745Z 
2025-05-07T20:32:01.9833819Z     @given(
2025-05-07T20:32:01.9833936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9834029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9834138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9834256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9834367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9834440Z     )
2025-05-07T20:32:01.9834685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9834774Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9834847Z         self,
2025-05-07T20:32:01.9834925Z         T: int,
2025-05-07T20:32:01.9835004Z         D: int,
2025-05-07T20:32:01.9835098Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9835186Z         contiguous: bool,
2025-05-07T20:32:01.9835267Z         compiled: bool,
2025-05-07T20:32:01.9835345Z     ) -> None:
2025-05-07T20:32:01.9835436Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9835504Z     
2025-05-07T20:32:01.9835671Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9837433Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9837509Z 
2025-05-07T20:32:01.9837627Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9837632Z 
2025-05-07T20:32:01.9837731Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9837948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9838026Z     T=4096,
2025-05-07T20:32:01.9838101Z     D=7168,
2025-05-07T20:32:01.9838181Z     scale_ub=None,
2025-05-07T20:32:01.9838266Z     contiguous=True,
2025-05-07T20:32:01.9838348Z     compiled=True,
2025-05-07T20:32:01.9838803Z )
2025-05-07T20:32:01.9839110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9839321Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.9839327Z 
2025-05-07T20:32:01.9839409Z     @given(
2025-05-07T20:32:01.9839523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9839717Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9839893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9840007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9840120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9840197Z     )
2025-05-07T20:32:01.9840437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9840530Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9840606Z         self,
2025-05-07T20:32:01.9840680Z         T: int,
2025-05-07T20:32:01.9840759Z         D: int,
2025-05-07T20:32:01.9840852Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9840939Z         contiguous: bool,
2025-05-07T20:32:01.9841025Z         compiled: bool,
2025-05-07T20:32:01.9841102Z     ) -> None:
2025-05-07T20:32:01.9841197Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9841271Z     
2025-05-07T20:32:01.9841439Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9843321Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9843400Z 
2025-05-07T20:32:01.9843516Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9843521Z 
2025-05-07T20:32:01.9843624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9843843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9843916Z     T=2048,
2025-05-07T20:32:01.9843992Z     D=5120,
2025-05-07T20:32:01.9844076Z     scale_ub=1200.0,
2025-05-07T20:32:01.9844165Z     contiguous=False,
2025-05-07T20:32:01.9844254Z     compiled=False,
2025-05-07T20:32:01.9844326Z )
2025-05-07T20:32:01.9844539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9844715Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9844720Z 
2025-05-07T20:32:01.9844793Z     @given(
2025-05-07T20:32:01.9844910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9845002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9845117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9845234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9845344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9845419Z     )
2025-05-07T20:32:01.9845663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9845755Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9845829Z         self,
2025-05-07T20:32:01.9845981Z         T: int,
2025-05-07T20:32:01.9846058Z         D: int,
2025-05-07T20:32:01.9846154Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9846247Z         contiguous: bool,
2025-05-07T20:32:01.9846332Z         compiled: bool,
2025-05-07T20:32:01.9846412Z     ) -> None:
2025-05-07T20:32:01.9846504Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9846574Z     
2025-05-07T20:32:01.9846743Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9848555Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9848603Z 
2025-05-07T20:32:01.9848724Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9848728Z 
2025-05-07T20:32:01.9848828Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9849045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9849121Z     T=4096,
2025-05-07T20:32:01.9849194Z     D=7168,
2025-05-07T20:32:01.9849271Z     scale_ub=1200.0,
2025-05-07T20:32:01.9849355Z     contiguous=True,
2025-05-07T20:32:01.9849435Z     compiled=False,
2025-05-07T20:32:01.9849509Z )
2025-05-07T20:32:01.9849721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9849892Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.9849897Z 
2025-05-07T20:32:01.9849976Z     @given(
2025-05-07T20:32:01.9850094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9850243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9850358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9850468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9850579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9850650Z     )
2025-05-07T20:32:01.9850889Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9850983Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9851058Z         self,
2025-05-07T20:32:01.9851132Z         T: int,
2025-05-07T20:32:01.9851210Z         D: int,
2025-05-07T20:32:01.9851305Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9851391Z         contiguous: bool,
2025-05-07T20:32:01.9851475Z         compiled: bool,
2025-05-07T20:32:01.9851551Z     ) -> None:
2025-05-07T20:32:01.9851643Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9851723Z     
2025-05-07T20:32:01.9851892Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9853671Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9853677Z 
2025-05-07T20:32:01.9853791Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9853796Z 
2025-05-07T20:32:01.9853903Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9854120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9854198Z     T=16384,
2025-05-07T20:32:01.9854278Z     D=7168,
2025-05-07T20:32:01.9854399Z     scale_ub=None,
2025-05-07T20:32:01.9854482Z     contiguous=False,
2025-05-07T20:32:01.9854565Z     compiled=True,
2025-05-07T20:32:01.9854641Z )
2025-05-07T20:32:01.9854854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9855026Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.9855032Z 
2025-05-07T20:32:01.9855105Z     @given(
2025-05-07T20:32:01.9855224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9855317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9855426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9855542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9855652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9855725Z     )
2025-05-07T20:32:01.9856094Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9856225Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9856300Z         self,
2025-05-07T20:32:01.9856379Z         T: int,
2025-05-07T20:32:01.9856451Z         D: int,
2025-05-07T20:32:01.9856547Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9856633Z         contiguous: bool,
2025-05-07T20:32:01.9856715Z         compiled: bool,
2025-05-07T20:32:01.9856793Z     ) -> None:
2025-05-07T20:32:01.9856884Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9856956Z     
2025-05-07T20:32:01.9857125Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9858899Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9858951Z 
2025-05-07T20:32:01.9859075Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9859080Z 
2025-05-07T20:32:01.9859180Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9859398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9859472Z     T=4096,
2025-05-07T20:32:01.9859550Z     D=7168,
2025-05-07T20:32:01.9859630Z     scale_ub=None,
2025-05-07T20:32:01.9859717Z     contiguous=True,
2025-05-07T20:32:01.9859798Z     compiled=False,
2025-05-07T20:32:01.9859876Z )
2025-05-07T20:32:01.9860091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9860258Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9860265Z 
2025-05-07T20:32:01.9860342Z     @given(
2025-05-07T20:32:01.9860458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9860555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9860670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9860782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9860895Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9860968Z     )
2025-05-07T20:32:01.9861209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9861301Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9861375Z         self,
2025-05-07T20:32:01.9861452Z         T: int,
2025-05-07T20:32:01.9861529Z         D: int,
2025-05-07T20:32:01.9861623Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9861708Z         contiguous: bool,
2025-05-07T20:32:01.9861792Z         compiled: bool,
2025-05-07T20:32:01.9861867Z     ) -> None:
2025-05-07T20:32:01.9861961Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9862035Z     
2025-05-07T20:32:01.9862250Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9864026Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9864032Z 
2025-05-07T20:32:01.9864146Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9864150Z 
2025-05-07T20:32:01.9864250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9864504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9864624Z     T=16384,
2025-05-07T20:32:01.9864698Z     D=7168,
2025-05-07T20:32:01.9864775Z     scale_ub=None,
2025-05-07T20:32:01.9864856Z     contiguous=True,
2025-05-07T20:32:01.9864936Z     compiled=False,
2025-05-07T20:32:01.9865003Z )
2025-05-07T20:32:01.9865215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9865390Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.9865395Z 
2025-05-07T20:32:01.9865467Z     @given(
2025-05-07T20:32:01.9865583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9865677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9865787Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9865908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9866022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9866094Z     )
2025-05-07T20:32:01.9866340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9866504Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9866576Z         self,
2025-05-07T20:32:01.9866655Z         T: int,
2025-05-07T20:32:01.9866730Z         D: int,
2025-05-07T20:32:01.9866829Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9866916Z         contiguous: bool,
2025-05-07T20:32:01.9866996Z         compiled: bool,
2025-05-07T20:32:01.9867076Z     ) -> None:
2025-05-07T20:32:01.9867167Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9867238Z     
2025-05-07T20:32:01.9867407Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9869174Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9869183Z 
2025-05-07T20:32:01.9869301Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9869305Z 
2025-05-07T20:32:01.9869406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9869626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9869706Z     T=16384,
2025-05-07T20:32:01.9869782Z     D=7168,
2025-05-07T20:32:01.9869865Z     scale_ub=1200.0,
2025-05-07T20:32:01.9869948Z     contiguous=True,
2025-05-07T20:32:01.9870030Z     compiled=False,
2025-05-07T20:32:01.9870103Z )
2025-05-07T20:32:01.9870315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9870489Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.9870541Z 
2025-05-07T20:32:01.9870618Z     @given(
2025-05-07T20:32:01.9870730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9870824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9870937Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9871048Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9871160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9871233Z     )
2025-05-07T20:32:01.9871472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9871565Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9871640Z         self,
2025-05-07T20:32:01.9871715Z         T: int,
2025-05-07T20:32:01.9871794Z         D: int,
2025-05-07T20:32:01.9871888Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9871974Z         contiguous: bool,
2025-05-07T20:32:01.9872101Z         compiled: bool,
2025-05-07T20:32:01.9872176Z     ) -> None:
2025-05-07T20:32:01.9872321Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9872390Z     
2025-05-07T20:32:01.9872555Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9874334Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9874340Z 
2025-05-07T20:32:01.9874453Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9874458Z 
2025-05-07T20:32:01.9874563Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9874788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9874904Z     T=128,
2025-05-07T20:32:01.9874980Z     D=5120,
2025-05-07T20:32:01.9875058Z     scale_ub=1200.0,
2025-05-07T20:32:01.9875140Z     contiguous=False,
2025-05-07T20:32:01.9875224Z     compiled=False,
2025-05-07T20:32:01.9875293Z )
2025-05-07T20:32:01.9875505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9875677Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.9875682Z 
2025-05-07T20:32:01.9875758Z     @given(
2025-05-07T20:32:01.9875875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9875968Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9876078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9876192Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9876302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9876380Z     )
2025-05-07T20:32:01.9876623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9876715Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9876789Z         self,
2025-05-07T20:32:01.9876869Z         T: int,
2025-05-07T20:32:01.9876945Z         D: int,
2025-05-07T20:32:01.9877042Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9877130Z         contiguous: bool,
2025-05-07T20:32:01.9877213Z         compiled: bool,
2025-05-07T20:32:01.9877291Z     ) -> None:
2025-05-07T20:32:01.9877382Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9877455Z     
2025-05-07T20:32:01.9877624Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9877693Z     
2025-05-07T20:32:01.9877782Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9877907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9877994Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9878117Z         x0 = x[:, :D]
2025-05-07T20:32:01.9878204Z         x1 = x[:, D:]
2025-05-07T20:32:01.9878273Z     
2025-05-07T20:32:01.9878354Z         if contiguous:
2025-05-07T20:32:01.9878446Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9878532Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9878608Z     
2025-05-07T20:32:01.9878696Z         if scale_ub is not None:
2025-05-07T20:32:01.9878799Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9878935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9879008Z             )
2025-05-07T20:32:01.9879082Z         else:
2025-05-07T20:32:01.9879180Z             scale_ub_tensor = None
2025-05-07T20:32:01.9879250Z     
2025-05-07T20:32:01.9879378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9884188Z             op = silu_mul_quant
2025-05-07T20:32:01.9884301Z             if compiled:
2025-05-07T20:32:01.9884470Z                 op = torch.compile(op)
2025-05-07T20:32:01.9884582Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9884700Z     
2025-05-07T20:32:01.9884793Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9884798Z 
2025-05-07T20:32:01.9884896Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9885029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9885130Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9885235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9885736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9885834Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9886197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9886419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9886763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9886913Z     kernel = self.compile(
2025-05-07T20:32:01.9887298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9887475Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9887601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9887606Z 
2025-05-07T20:32:01.9887811Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf5fc9d0>
2025-05-07T20:32:01.9888589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9889095Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf52b7e0>}
2025-05-07T20:32:01.9889853Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9890044Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf5732b0>
2025-05-07T20:32:01.9890049Z 
2025-05-07T20:32:01.9890216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9890490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9890597Z                            module_map=module_map)
2025-05-07T20:32:01.9890759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9890858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9890935Z E       ^
2025-05-07T20:32:01.9891340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9891351Z 
2025-05-07T20:32:01.9891764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9891769Z 
2025-05-07T20:32:01.9891872Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9892090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9892170Z     T=2048,
2025-05-07T20:32:01.9892254Z     D=7168,
2025-05-07T20:32:01.9892336Z     scale_ub=None,
2025-05-07T20:32:01.9892423Z     contiguous=False,
2025-05-07T20:32:01.9892514Z     compiled=False,
2025-05-07T20:32:01.9892586Z )
2025-05-07T20:32:01.9892801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9892979Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.9892984Z 
2025-05-07T20:32:01.9893101Z     @given(
2025-05-07T20:32:01.9893223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9893361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9893474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9893596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9893710Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9893793Z     )
2025-05-07T20:32:01.9894080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9894175Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9894248Z         self,
2025-05-07T20:32:01.9894327Z         T: int,
2025-05-07T20:32:01.9894405Z         D: int,
2025-05-07T20:32:01.9894506Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9894596Z         contiguous: bool,
2025-05-07T20:32:01.9894681Z         compiled: bool,
2025-05-07T20:32:01.9894766Z     ) -> None:
2025-05-07T20:32:01.9894863Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9894937Z     
2025-05-07T20:32:01.9895113Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9896937Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9896943Z 
2025-05-07T20:32:01.9897065Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9897069Z 
2025-05-07T20:32:01.9897172Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9897391Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9897478Z     T=128,
2025-05-07T20:32:01.9897561Z     D=7168,
2025-05-07T20:32:01.9897644Z     scale_ub=1200.0,
2025-05-07T20:32:01.9897728Z     contiguous=True,
2025-05-07T20:32:01.9897808Z     compiled=True,
2025-05-07T20:32:01.9897885Z )
2025-05-07T20:32:01.9898098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9898262Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9898267Z 
2025-05-07T20:32:01.9898344Z     @given(
2025-05-07T20:32:01.9898460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9898556Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9898670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9898785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9898901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9898975Z     )
2025-05-07T20:32:01.9899260Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9899362Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9899441Z         self,
2025-05-07T20:32:01.9899515Z         T: int,
2025-05-07T20:32:01.9899593Z         D: int,
2025-05-07T20:32:01.9899685Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9899770Z         contiguous: bool,
2025-05-07T20:32:01.9899859Z         compiled: bool,
2025-05-07T20:32:01.9899936Z     ) -> None:
2025-05-07T20:32:01.9900028Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9900103Z     
2025-05-07T20:32:01.9900268Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9900347Z     
2025-05-07T20:32:01.9900435Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9900557Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9900648Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9900729Z         x0 = x[:, :D]
2025-05-07T20:32:01.9900875Z         x1 = x[:, D:]
2025-05-07T20:32:01.9900954Z     
2025-05-07T20:32:01.9901076Z         if contiguous:
2025-05-07T20:32:01.9901170Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9901268Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9901338Z     
2025-05-07T20:32:01.9901428Z         if scale_ub is not None:
2025-05-07T20:32:01.9901537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9901668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9901742Z             )
2025-05-07T20:32:01.9901821Z         else:
2025-05-07T20:32:01.9901914Z             scale_ub_tensor = None
2025-05-07T20:32:01.9901991Z     
2025-05-07T20:32:01.9902118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9902206Z             op = silu_mul_quant
2025-05-07T20:32:01.9902292Z             if compiled:
2025-05-07T20:32:01.9902390Z                 op = torch.compile(op)
2025-05-07T20:32:01.9902494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9902574Z     
2025-05-07T20:32:01.9902663Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9902714Z 
2025-05-07T20:32:01.9902814Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9902946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9903043Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9903144Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9903512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.9903606Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.9904104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9904197Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9904553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9904786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9905133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9905231Z     kernel = self.compile(
2025-05-07T20:32:01.9905613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9905785Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9905912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9905917Z 
2025-05-07T20:32:01.9906121Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bf8f49d0>
2025-05-07T20:32:01.9906895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9907435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd3dd14c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bf8d6a20>}
2025-05-07T20:32:01.9908190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9908380Z context = <triton._C.libtriton.ir.context object at 0x7fd1bf3526b0>
2025-05-07T20:32:01.9908385Z 
2025-05-07T20:32:01.9908545Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9908808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9908911Z                            module_map=module_map)
2025-05-07T20:32:01.9909068Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9909170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9909291Z E       ^
2025-05-07T20:32:01.9909684Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9909697Z 
2025-05-07T20:32:01.9910111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9910116Z 
2025-05-07T20:32:01.9910216Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9910440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9910516Z     T=128,
2025-05-07T20:32:01.9910592Z     D=7168,
2025-05-07T20:32:01.9910680Z     scale_ub=1200.0,
2025-05-07T20:32:01.9910764Z     contiguous=True,
2025-05-07T20:32:01.9910848Z     compiled=False,
2025-05-07T20:32:01.9910926Z )
2025-05-07T20:32:01.9911141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9911311Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.9911315Z 
2025-05-07T20:32:01.9911403Z     @given(
2025-05-07T20:32:01.9911562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9911656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9911773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9911885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9911998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9912071Z     )
2025-05-07T20:32:01.9912309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9912404Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9912477Z         self,
2025-05-07T20:32:01.9912553Z         T: int,
2025-05-07T20:32:01.9912631Z         D: int,
2025-05-07T20:32:01.9912726Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9912813Z         contiguous: bool,
2025-05-07T20:32:01.9912898Z         compiled: bool,
2025-05-07T20:32:01.9912979Z     ) -> None:
2025-05-07T20:32:01.9913070Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9913147Z     
2025-05-07T20:32:01.9913312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9913387Z     
2025-05-07T20:32:01.9913475Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9913597Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9915425Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9915432Z 
2025-05-07T20:32:01.9915548Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.9915598Z 
2025-05-07T20:32:01.9915702Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9915919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9915992Z     T=128,
2025-05-07T20:32:01.9916075Z     D=5120,
2025-05-07T20:32:01.9916154Z     scale_ub=1200.0,
2025-05-07T20:32:01.9916235Z     contiguous=True,
2025-05-07T20:32:01.9916317Z     compiled=True,
2025-05-07T20:32:01.9916387Z )
2025-05-07T20:32:01.9916599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9916767Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.9916771Z 
2025-05-07T20:32:01.9916845Z     @given(
2025-05-07T20:32:01.9916962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9917056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9917207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9917327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9917477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9917549Z     )
2025-05-07T20:32:01.9917796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9917885Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9917962Z         self,
2025-05-07T20:32:01.9918033Z         T: int,
2025-05-07T20:32:01.9918107Z         D: int,
2025-05-07T20:32:01.9918204Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9918289Z         contiguous: bool,
2025-05-07T20:32:01.9918371Z         compiled: bool,
2025-05-07T20:32:01.9918450Z     ) -> None:
2025-05-07T20:32:01.9918541Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9918612Z     
2025-05-07T20:32:01.9918781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9918851Z     
2025-05-07T20:32:01.9918939Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9919067Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9920877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9920882Z 
2025-05-07T20:32:01.9921001Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.9921006Z 
2025-05-07T20:32:01.9921106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9921329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9921405Z     T=128,
2025-05-07T20:32:01.9921479Z     D=7168,
2025-05-07T20:32:01.9921568Z     scale_ub=None,
2025-05-07T20:32:01.9921650Z     contiguous=True,
2025-05-07T20:32:01.9921729Z     compiled=True,
2025-05-07T20:32:01.9921804Z )
2025-05-07T20:32:01.9922014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9922174Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.9922181Z 
2025-05-07T20:32:01.9922253Z     @given(
2025-05-07T20:32:01.9922365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9922464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9922574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9922688Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9922802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9922873Z     )
2025-05-07T20:32:01.9923111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9923356Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9923436Z         self,
2025-05-07T20:32:01.9923511Z         T: int,
2025-05-07T20:32:01.9923587Z         D: int,
2025-05-07T20:32:01.9923683Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9923792Z         contiguous: bool,
2025-05-07T20:32:01.9923881Z         compiled: bool,
2025-05-07T20:32:01.9923977Z     ) -> None:
2025-05-07T20:32:01.9924078Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9924148Z     
2025-05-07T20:32:01.9924311Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9926131Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.9926178Z 
2025-05-07T20:32:01.9926294Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.9926427Z =============================== warnings summary ===============================
2025-05-07T20:32:01.9926734Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:01.9927031Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:01.9927329Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:01.9928201Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:01.9928479Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:01.9928484Z 
2025-05-07T20:32:01.9928697Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:01.9928866Z ================= 1 failed, 1 deselected, 3 warnings in 16.63s =================
2025-05-07T20:32:03.7209799Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:03.7891672Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:03.7891916Z 
2025-05-07T20:32:05.7911788Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:07.9408013Z ============================= test session starts ==============================
2025-05-07T20:32:07.9408693Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:07.9409216Z cachedir: .pytest_cache
2025-05-07T20:32:07.9409792Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:07.9410516Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:07.9410926Z plugins: hypothesis-6.131.14
2025-05-07T20:32:09.5490024Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:09.6999759Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:09.7000339Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:09.7000620Z 
2025-05-07T20:32:12.1285891Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1286967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1287442Z     T=1,
2025-05-07T20:32:12.1287810Z     D=5120,
2025-05-07T20:32:12.1288195Z     scale_ub=None,
2025-05-07T20:32:12.1288627Z     contiguous=True,
2025-05-07T20:32:12.1289063Z     compiled=True,
2025-05-07T20:32:12.1289472Z )
2025-05-07T20:32:12.1290113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1291077Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.1291615Z 
2025-05-07T20:32:12.1291772Z     @given(
2025-05-07T20:32:12.1292233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1292856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1293462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1294114Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1294912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1295602Z     )
2025-05-07T20:32:12.1296303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1297185Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1297573Z         self,
2025-05-07T20:32:12.1297802Z         T: int,
2025-05-07T20:32:12.1298023Z         D: int,
2025-05-07T20:32:12.1298240Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1298514Z         contiguous: bool,
2025-05-07T20:32:12.1298755Z         compiled: bool,
2025-05-07T20:32:12.1298976Z     ) -> None:
2025-05-07T20:32:12.1299199Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1299443Z     
2025-05-07T20:32:12.1299714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1300061Z     
2025-05-07T20:32:12.1300262Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1300550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1300868Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1301112Z         x0 = x[:, :D]
2025-05-07T20:32:12.1301438Z         x1 = x[:, D:]
2025-05-07T20:32:12.1301646Z     
2025-05-07T20:32:12.1301840Z         if contiguous:
2025-05-07T20:32:12.1302081Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1302336Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1302582Z     
2025-05-07T20:32:12.1302782Z         if scale_ub is not None:
2025-05-07T20:32:12.1303055Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1303399Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1303723Z             )
2025-05-07T20:32:12.1303916Z         else:
2025-05-07T20:32:12.1304135Z             scale_ub_tensor = None
2025-05-07T20:32:12.1304395Z     
2025-05-07T20:32:12.1304633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1304954Z             op = silu_mul_quant
2025-05-07T20:32:12.1305210Z             if compiled:
2025-05-07T20:32:12.1305461Z                 op = torch.compile(op)
2025-05-07T20:32:12.1305768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1306055Z     
2025-05-07T20:32:12.1306254Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.1306536Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.1306832Z     
2025-05-07T20:32:12.1307076Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1307408Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.1307705Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.1308025Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.1308382Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.1308702Z     
2025-05-07T20:32:12.1308907Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.1309105Z 
2025-05-07T20:32:12.1309206Z moe/activation_test.py:126: 
2025-05-07T20:32:12.1309510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1309848Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.1310237Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.1311032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.1311798Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.1312348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1313032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1313731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.1314464Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.1315280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.1316077Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.1316820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.1317471Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.1318077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.1318595Z     fn()
2025-05-07T20:32:12.1319110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.1319697Z     self.fn.run(
2025-05-07T20:32:12.1320162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1320699Z     kernel = self.compile(
2025-05-07T20:32:12.1321254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1321967Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1322356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1322596Z 
2025-05-07T20:32:12.1322803Z self = <triton.compiler.compiler.ASTSource object at 0x7f18d1608250>
2025-05-07T20:32:12.1324042Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1325436Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cda813a0>}
2025-05-07T20:32:12.1326778Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1327873Z context = <triton._C.libtriton.ir.context object at 0x7f18fde1e330>
2025-05-07T20:32:12.1328170Z 
2025-05-07T20:32:12.1328338Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1328868Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1329334Z                            module_map=module_map)
2025-05-07T20:32:12.1329704Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1330070Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.1330345Z E       ^
2025-05-07T20:32:12.1330808Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1331269Z 
2025-05-07T20:32:12.1331693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1332262Z 
2025-05-07T20:32:12.1332387Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1332868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1333334Z     T=2048,
2025-05-07T20:32:12.1333540Z     D=5120,
2025-05-07T20:32:12.1333752Z     scale_ub=1200.0,
2025-05-07T20:32:12.1333990Z     contiguous=True,
2025-05-07T20:32:12.1334232Z     compiled=False,
2025-05-07T20:32:12.1334458Z )
2025-05-07T20:32:13.0821479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0822613Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.0823172Z 
2025-05-07T20:32:13.0823354Z     @given(
2025-05-07T20:32:13.0823814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0824442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0825463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0826130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0826955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0827524Z     )
2025-05-07T20:32:13.0828042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0828484Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0828733Z         self,
2025-05-07T20:32:13.0828938Z         T: int,
2025-05-07T20:32:13.0829139Z         D: int,
2025-05-07T20:32:13.0829361Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0829638Z         contiguous: bool,
2025-05-07T20:32:13.0829880Z         compiled: bool,
2025-05-07T20:32:13.0830123Z     ) -> None:
2025-05-07T20:32:13.0830347Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0830587Z     
2025-05-07T20:32:13.0830873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0831228Z     
2025-05-07T20:32:13.0831426Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0831727Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0832166Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0832414Z         x0 = x[:, :D]
2025-05-07T20:32:13.0832627Z         x1 = x[:, D:]
2025-05-07T20:32:13.0832842Z     
2025-05-07T20:32:13.0833036Z         if contiguous:
2025-05-07T20:32:13.0833271Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0833538Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0833785Z     
2025-05-07T20:32:13.0833975Z         if scale_ub is not None:
2025-05-07T20:32:13.0834256Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0834596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0834900Z             )
2025-05-07T20:32:13.0835096Z         else:
2025-05-07T20:32:13.0835313Z             scale_ub_tensor = None
2025-05-07T20:32:13.0835560Z     
2025-05-07T20:32:13.0835799Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0836120Z             op = silu_mul_quant
2025-05-07T20:32:13.0836368Z             if compiled:
2025-05-07T20:32:13.0836627Z                 op = torch.compile(op)
2025-05-07T20:32:13.0836927Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0837201Z     
2025-05-07T20:32:13.0837400Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0837574Z 
2025-05-07T20:32:13.0837682Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0837988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0838324Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0838775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0839476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0840168Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0840712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0841506Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0842192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0842729Z     kernel = self.compile(
2025-05-07T20:32:13.0843380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0844049Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0844445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0844687Z 
2025-05-07T20:32:13.0844898Z self = <triton.compiler.compiler.ASTSource object at 0x7f18cdc4f590>
2025-05-07T20:32:13.0846049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0847488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cd7382c0>}
2025-05-07T20:32:13.0848842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0849866Z context = <triton._C.libtriton.ir.context object at 0x7f18cdeab9b0>
2025-05-07T20:32:13.0850164Z 
2025-05-07T20:32:13.0850335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0850874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0851351Z                            module_map=module_map)
2025-05-07T20:32:13.0851718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0852083Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0852354Z E       ^
2025-05-07T20:32:13.0852915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0853376Z 
2025-05-07T20:32:13.0861476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0862142Z 
2025-05-07T20:32:13.0862258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0862679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0863085Z     T=2048,
2025-05-07T20:32:13.0863280Z     D=5120,
2025-05-07T20:32:13.0863481Z     scale_ub=1200.0,
2025-05-07T20:32:13.0863703Z     contiguous=True,
2025-05-07T20:32:13.0863932Z     compiled=True,
2025-05-07T20:32:13.0864148Z )
2025-05-07T20:32:13.0864467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0864976Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:13.0865268Z 
2025-05-07T20:32:13.0865347Z     @given(
2025-05-07T20:32:13.0865584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0865898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0866210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0866546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0866869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0867165Z     )
2025-05-07T20:32:13.0867518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0867955Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0868203Z         self,
2025-05-07T20:32:13.0868401Z         T: int,
2025-05-07T20:32:13.0868590Z         D: int,
2025-05-07T20:32:13.0868812Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0869088Z         contiguous: bool,
2025-05-07T20:32:13.0869335Z         compiled: bool,
2025-05-07T20:32:13.0869551Z     ) -> None:
2025-05-07T20:32:13.0869855Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0870104Z     
2025-05-07T20:32:13.0870377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0870723Z     
2025-05-07T20:32:13.0870923Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0871212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0871531Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0871772Z         x0 = x[:, :D]
2025-05-07T20:32:13.0871981Z         x1 = x[:, D:]
2025-05-07T20:32:13.0872192Z     
2025-05-07T20:32:13.0872380Z         if contiguous:
2025-05-07T20:32:13.0872608Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0872872Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0873121Z     
2025-05-07T20:32:13.0873310Z         if scale_ub is not None:
2025-05-07T20:32:13.0873586Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0873969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0874329Z             )
2025-05-07T20:32:13.0874524Z         else:
2025-05-07T20:32:13.0874736Z             scale_ub_tensor = None
2025-05-07T20:32:13.0874991Z     
2025-05-07T20:32:13.0875218Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0875537Z             op = silu_mul_quant
2025-05-07T20:32:13.0875793Z             if compiled:
2025-05-07T20:32:13.0876039Z                 op = torch.compile(op)
2025-05-07T20:32:13.0876345Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0876629Z     
2025-05-07T20:32:13.0876817Z         y_fp8, y_scale = fn()
2025-05-07T20:32:13.0877108Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:13.0877409Z     
2025-05-07T20:32:13.0877641Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0877987Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:13.0878285Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:13.0878596Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:13.0879014Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:13.0879332Z     
2025-05-07T20:32:13.0879537Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:13.0879731Z 
2025-05-07T20:32:13.0879831Z moe/activation_test.py:126: 
2025-05-07T20:32:13.0880134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0880474Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:13.0880799Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:13.0881599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:13.0882362Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:13.0882918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0883723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0884426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:13.0885163Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:13.0885930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:13.0886680Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:13.0887416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:13.0888057Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:13.0888708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:13.0889232Z     fn()
2025-05-07T20:32:13.0889795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:13.0890377Z     self.fn.run(
2025-05-07T20:32:13.0890851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0891383Z     kernel = self.compile(
2025-05-07T20:32:13.0891922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0892579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0892976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0893202Z 
2025-05-07T20:32:13.0893415Z self = <triton.compiler.compiler.ASTSource object at 0x7f18cc4ca950>
2025-05-07T20:32:13.0894529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0895935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cd739440>}
2025-05-07T20:32:13.0897269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0898292Z context = <triton._C.libtriton.ir.context object at 0x7f18cc4cef30>
2025-05-07T20:32:13.0898577Z 
2025-05-07T20:32:13.0898750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0899268Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0899738Z                            module_map=module_map)
2025-05-07T20:32:13.0900108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0900504Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:13.0900773Z E       ^
2025-05-07T20:32:13.0901241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0901691Z 
2025-05-07T20:32:13.0902121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0902631Z 
2025-05-07T20:32:13.0902735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0903147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0903550Z     T=16384,
2025-05-07T20:32:13.0903742Z     D=7168,
2025-05-07T20:32:13.0903938Z     scale_ub=1200.0,
2025-05-07T20:32:13.0904162Z     contiguous=False,
2025-05-07T20:32:13.0904379Z     compiled=False,
2025-05-07T20:32:13.0904586Z )
2025-05-07T20:32:13.8995606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8996179Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.8996468Z 
2025-05-07T20:32:13.8996557Z     @given(
2025-05-07T20:32:13.8996790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8997107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8997418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8997752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8998119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8998406Z     )
2025-05-07T20:32:13.8998759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8999193Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8999445Z         self,
2025-05-07T20:32:13.8999643Z         T: int,
2025-05-07T20:32:13.8999839Z         D: int,
2025-05-07T20:32:13.9000068Z         scale_ub: Optional[float],
2025-05-07T20:32:13.9000539Z         contiguous: bool,
2025-05-07T20:32:13.9000775Z         compiled: bool,
2025-05-07T20:32:13.9000996Z     ) -> None:
2025-05-07T20:32:13.9001213Z         torch.manual_seed(2025)
2025-05-07T20:32:13.9001450Z     
2025-05-07T20:32:13.9001727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.9002073Z     
2025-05-07T20:32:13.9002269Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.9002558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.9002866Z         x = x_sign * x_clamp
2025-05-07T20:32:13.9003108Z         x0 = x[:, :D]
2025-05-07T20:32:13.9003443Z         x1 = x[:, D:]
2025-05-07T20:32:13.9003657Z     
2025-05-07T20:32:13.9003839Z         if contiguous:
2025-05-07T20:32:13.9004064Z             x0 = x0.contiguous()
2025-05-07T20:32:13.9004315Z             x1 = x1.contiguous()
2025-05-07T20:32:13.9004547Z     
2025-05-07T20:32:13.9004802Z         if scale_ub is not None:
2025-05-07T20:32:13.9005073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.9005467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.9005771Z             )
2025-05-07T20:32:13.9005964Z         else:
2025-05-07T20:32:13.9006176Z             scale_ub_tensor = None
2025-05-07T20:32:13.9006427Z     
2025-05-07T20:32:13.9006664Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.9006980Z             op = silu_mul_quant
2025-05-07T20:32:13.9007224Z             if compiled:
2025-05-07T20:32:13.9007476Z                 op = torch.compile(op)
2025-05-07T20:32:13.9007775Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9008051Z     
2025-05-07T20:32:13.9008240Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.9008412Z 
2025-05-07T20:32:13.9008515Z moe/activation_test.py:117: 
2025-05-07T20:32:13.9008811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9009140Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.9009426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9010191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.9010885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.9011418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.9012107Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.9012774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.9013306Z     kernel = self.compile(
2025-05-07T20:32:13.9013856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.9014521Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.9014926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9015159Z 
2025-05-07T20:32:13.9015367Z self = <triton.compiler.compiler.ASTSource object at 0x7f18cc50d090>
2025-05-07T20:32:13.9016444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.9017806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc82e660>}
2025-05-07T20:32:13.9019140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.9020161Z context = <triton._C.libtriton.ir.context object at 0x7f18cc5096b0>
2025-05-07T20:32:13.9020460Z 
2025-05-07T20:32:13.9020673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.9021204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.9021676Z                            module_map=module_map)
2025-05-07T20:32:13.9022044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.9022403Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.9022671Z E       ^
2025-05-07T20:32:13.9023138Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.9023596Z 
2025-05-07T20:32:13.9024016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.9024535Z 
2025-05-07T20:32:13.9024641Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.9025101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.9025588Z     T=1,
2025-05-07T20:32:13.9025769Z     D=7168,
2025-05-07T20:32:13.9025971Z     scale_ub=None,
2025-05-07T20:32:13.9026189Z     contiguous=True,
2025-05-07T20:32:13.9026412Z     compiled=True,
2025-05-07T20:32:13.9026624Z )
2025-05-07T20:32:13.9026945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.9027425Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.9027696Z 
2025-05-07T20:32:13.9027773Z     @given(
2025-05-07T20:32:13.9028009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.9028319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.9028625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.9028959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.9029289Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.9029576Z     )
2025-05-07T20:32:13.9029932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.9030429Z     def test_silu_mul_quant(
2025-05-07T20:32:13.9030666Z         self,
2025-05-07T20:32:13.9030864Z         T: int,
2025-05-07T20:32:13.9031064Z         D: int,
2025-05-07T20:32:13.9031274Z         scale_ub: Optional[float],
2025-05-07T20:32:13.9031548Z         contiguous: bool,
2025-05-07T20:32:13.9031788Z         compiled: bool,
2025-05-07T20:32:13.9032007Z     ) -> None:
2025-05-07T20:32:13.9032227Z         torch.manual_seed(2025)
2025-05-07T20:32:13.9032468Z     
2025-05-07T20:32:13.9032735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.9033079Z     
2025-05-07T20:32:13.9033274Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.9033561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.9033872Z         x = x_sign * x_clamp
2025-05-07T20:32:13.9034115Z         x0 = x[:, :D]
2025-05-07T20:32:13.9034332Z         x1 = x[:, D:]
2025-05-07T20:32:13.9034550Z     
2025-05-07T20:32:13.9034749Z         if contiguous:
2025-05-07T20:32:13.9034989Z             x0 = x0.contiguous()
2025-05-07T20:32:13.9035242Z             x1 = x1.contiguous()
2025-05-07T20:32:13.9035487Z     
2025-05-07T20:32:13.9035678Z         if scale_ub is not None:
2025-05-07T20:32:13.9035948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.9036289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.9036604Z             )
2025-05-07T20:32:13.9036793Z         else:
2025-05-07T20:32:13.9037008Z             scale_ub_tensor = None
2025-05-07T20:32:13.9037261Z     
2025-05-07T20:32:13.9037494Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.9037813Z             op = silu_mul_quant
2025-05-07T20:32:13.9038067Z             if compiled:
2025-05-07T20:32:13.9038333Z                 op = torch.compile(op)
2025-05-07T20:32:13.9038810Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9039090Z     
2025-05-07T20:32:13.9039371Z         y_fp8, y_scale = fn()
2025-05-07T20:32:13.9039657Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:13.9039952Z     
2025-05-07T20:32:13.9040195Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.9040526Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:13.9040820Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:13.9041135Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:13.9041499Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:13.9041807Z     
2025-05-07T20:32:13.9042018Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:13.9042212Z 
2025-05-07T20:32:13.9042321Z moe/activation_test.py:126: 
2025-05-07T20:32:13.9042612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9042948Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:13.9043473Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:13.9044311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:13.9045070Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:13.9045623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.9046311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.9046997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:13.9047724Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:13.9048483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:13.9049242Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:13.9050045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:13.9050690Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:13.9051292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:13.9051810Z     fn()
2025-05-07T20:32:13.9052322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:13.9052908Z     self.fn.run(
2025-05-07T20:32:13.9053382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.9054083Z     kernel = self.compile(
2025-05-07T20:32:13.9054635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.9055298Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.9055695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9055930Z 
2025-05-07T20:32:13.9056139Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3f64990>
2025-05-07T20:32:13.9057560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.9059144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc5ea5c0>}
2025-05-07T20:32:13.9060485Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.9061568Z context = <triton._C.libtriton.ir.context object at 0x7f18a3f7b030>
2025-05-07T20:32:13.9061865Z 
2025-05-07T20:32:13.9062032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.9062557Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.9063029Z                            module_map=module_map)
2025-05-07T20:32:13.9063392Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.9063751Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:13.9064025Z E       ^
2025-05-07T20:32:13.9064488Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.9064944Z 
2025-05-07T20:32:13.9065365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.9065931Z 
2025-05-07T20:32:13.9066039Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.9066508Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.9066909Z     T=4096,
2025-05-07T20:32:13.9067102Z     D=5120,
2025-05-07T20:32:13.9067300Z     scale_ub=None,
2025-05-07T20:32:13.9067511Z     contiguous=False,
2025-05-07T20:32:13.9067740Z     compiled=False,
2025-05-07T20:32:13.9067955Z )
2025-05-07T20:32:14.8310405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.8310958Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.8311254Z 
2025-05-07T20:32:14.8311340Z     @given(
2025-05-07T20:32:14.8311583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.8311898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.8312212Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.8312554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.8312882Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.8313319Z     )
2025-05-07T20:32:14.8313675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.8314121Z     def test_silu_mul_quant(
2025-05-07T20:32:14.8314372Z         self,
2025-05-07T20:32:14.8314578Z         T: int,
2025-05-07T20:32:14.8314770Z         D: int,
2025-05-07T20:32:14.8314997Z         scale_ub: Optional[float],
2025-05-07T20:32:14.8315277Z         contiguous: bool,
2025-05-07T20:32:14.8315514Z         compiled: bool,
2025-05-07T20:32:14.8315748Z     ) -> None:
2025-05-07T20:32:14.8315965Z         torch.manual_seed(2025)
2025-05-07T20:32:14.8316213Z     
2025-05-07T20:32:14.8316480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.8316827Z     
2025-05-07T20:32:14.8317020Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.8317309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.8317626Z         x = x_sign * x_clamp
2025-05-07T20:32:14.8317872Z         x0 = x[:, :D]
2025-05-07T20:32:14.8318104Z         x1 = x[:, D:]
2025-05-07T20:32:14.8318349Z     
2025-05-07T20:32:14.8318537Z         if contiguous:
2025-05-07T20:32:14.8318767Z             x0 = x0.contiguous()
2025-05-07T20:32:14.8319031Z             x1 = x1.contiguous()
2025-05-07T20:32:14.8319277Z     
2025-05-07T20:32:14.8319465Z         if scale_ub is not None:
2025-05-07T20:32:14.8319740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.8320080Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.8320388Z             )
2025-05-07T20:32:14.8320585Z         else:
2025-05-07T20:32:14.8320800Z             scale_ub_tensor = None
2025-05-07T20:32:14.8321055Z     
2025-05-07T20:32:14.8321285Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.8321603Z             op = silu_mul_quant
2025-05-07T20:32:14.8321859Z             if compiled:
2025-05-07T20:32:14.8322109Z                 op = torch.compile(op)
2025-05-07T20:32:14.8322487Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8322774Z     
2025-05-07T20:32:14.8322962Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.8323132Z 
2025-05-07T20:32:14.8323314Z moe/activation_test.py:117: 
2025-05-07T20:32:14.8323614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8323944Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.8324234Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8324930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.8325630Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.8326168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.8326931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.8327620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.8328219Z     kernel = self.compile(
2025-05-07T20:32:14.8328771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.8329442Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.8329846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8330076Z 
2025-05-07T20:32:14.8330287Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3fb0790>
2025-05-07T20:32:14.8331375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.8332765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc068540>}
2025-05-07T20:32:14.8334157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.8335197Z context = <triton._C.libtriton.ir.context object at 0x7f18a3fe0a30>
2025-05-07T20:32:14.8335485Z 
2025-05-07T20:32:14.8335657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.8336187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.8336663Z                            module_map=module_map)
2025-05-07T20:32:14.8337029Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.8337388Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.8337649Z E       ^
2025-05-07T20:32:14.8338125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.8338827Z 
2025-05-07T20:32:14.8339251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.8339774Z 
2025-05-07T20:32:14.8339879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.8340298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.8340706Z     T=4096,
2025-05-07T20:32:14.8340892Z     D=7168,
2025-05-07T20:32:14.8341093Z     scale_ub=None,
2025-05-07T20:32:14.8341307Z     contiguous=False,
2025-05-07T20:32:14.8348249Z     compiled=False,
2025-05-07T20:32:14.8348541Z )
2025-05-07T20:32:14.8348871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.8349387Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.8349661Z 
2025-05-07T20:32:14.8349748Z     @given(
2025-05-07T20:32:14.8350110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.8350440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.8350740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.8351079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.8351409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.8351717Z     )
2025-05-07T20:32:14.8352062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.8352510Z     def test_silu_mul_quant(
2025-05-07T20:32:14.8352754Z         self,
2025-05-07T20:32:14.8352952Z         T: int,
2025-05-07T20:32:14.8353147Z         D: int,
2025-05-07T20:32:14.8353370Z         scale_ub: Optional[float],
2025-05-07T20:32:14.8353643Z         contiguous: bool,
2025-05-07T20:32:14.8353875Z         compiled: bool,
2025-05-07T20:32:14.8354103Z     ) -> None:
2025-05-07T20:32:14.8354388Z         torch.manual_seed(2025)
2025-05-07T20:32:14.8354629Z     
2025-05-07T20:32:14.8354964Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.8355314Z     
2025-05-07T20:32:14.8355505Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.8355799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.8356109Z         x = x_sign * x_clamp
2025-05-07T20:32:14.8356344Z         x0 = x[:, :D]
2025-05-07T20:32:14.8356563Z         x1 = x[:, D:]
2025-05-07T20:32:14.8356777Z     
2025-05-07T20:32:14.8356964Z         if contiguous:
2025-05-07T20:32:14.8357202Z             x0 = x0.contiguous()
2025-05-07T20:32:14.8357467Z             x1 = x1.contiguous()
2025-05-07T20:32:14.8357702Z     
2025-05-07T20:32:14.8357901Z         if scale_ub is not None:
2025-05-07T20:32:14.8358184Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.8358525Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.8358832Z             )
2025-05-07T20:32:14.8359035Z         else:
2025-05-07T20:32:14.8359255Z             scale_ub_tensor = None
2025-05-07T20:32:14.8359579Z     
2025-05-07T20:32:14.8359816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.8360137Z             op = silu_mul_quant
2025-05-07T20:32:14.8360379Z             if compiled:
2025-05-07T20:32:14.8360631Z                 op = torch.compile(op)
2025-05-07T20:32:14.8360934Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8361204Z     
2025-05-07T20:32:14.8361396Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.8361559Z 
2025-05-07T20:32:14.8361667Z moe/activation_test.py:117: 
2025-05-07T20:32:14.8361957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8362297Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.8362586Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8363394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.8364091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.8364641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.8365331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.8366003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.8366536Z     kernel = self.compile(
2025-05-07T20:32:14.8367090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.8367753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.8368151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8368396Z 
2025-05-07T20:32:14.8368643Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3683050>
2025-05-07T20:32:14.8369782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.8371158Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc5cc720>}
2025-05-07T20:32:14.8372499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.8373514Z context = <triton._C.libtriton.ir.context object at 0x7f18a363b670>
2025-05-07T20:32:14.8373807Z 
2025-05-07T20:32:14.8373975Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.8374537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.8375051Z                            module_map=module_map)
2025-05-07T20:32:14.8375413Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.8375770Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.8376039Z E       ^
2025-05-07T20:32:14.8376501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.8376959Z 
2025-05-07T20:32:14.8377381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.8377907Z 
2025-05-07T20:32:14.8378012Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.8378435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.8378835Z     T=128,
2025-05-07T20:32:14.8379023Z     D=7168,
2025-05-07T20:32:14.8379209Z     scale_ub=None,
2025-05-07T20:32:14.8379426Z     contiguous=False,
2025-05-07T20:32:14.8379658Z     compiled=True,
2025-05-07T20:32:14.8379905Z )
2025-05-07T20:32:14.8806430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.8806978Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.8807271Z 
2025-05-07T20:32:14.8807358Z     @given(
2025-05-07T20:32:14.8807595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.8807908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.8808216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.8808546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.8808871Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.8809162Z     )
2025-05-07T20:32:14.8809514Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.8809951Z     def test_silu_mul_quant(
2025-05-07T20:32:14.8810201Z         self,
2025-05-07T20:32:14.8810405Z         T: int,
2025-05-07T20:32:14.8810600Z         D: int,
2025-05-07T20:32:14.8810825Z         scale_ub: Optional[float],
2025-05-07T20:32:14.8811100Z         contiguous: bool,
2025-05-07T20:32:14.8811335Z         compiled: bool,
2025-05-07T20:32:14.8811564Z     ) -> None:
2025-05-07T20:32:14.8811787Z         torch.manual_seed(2025)
2025-05-07T20:32:14.8812025Z     
2025-05-07T20:32:14.8812296Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.8812641Z     
2025-05-07T20:32:14.8812841Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.8813132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.8813441Z         x = x_sign * x_clamp
2025-05-07T20:32:14.8813687Z         x0 = x[:, :D]
2025-05-07T20:32:14.8813900Z         x1 = x[:, D:]
2025-05-07T20:32:14.8814110Z     
2025-05-07T20:32:14.8814302Z         if contiguous:
2025-05-07T20:32:14.8814527Z             x0 = x0.contiguous()
2025-05-07T20:32:14.8814788Z             x1 = x1.contiguous()
2025-05-07T20:32:14.8815038Z     
2025-05-07T20:32:14.8815325Z         if scale_ub is not None:
2025-05-07T20:32:14.8815610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.8815957Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.8816265Z             )
2025-05-07T20:32:14.8816464Z         else:
2025-05-07T20:32:14.8816679Z             scale_ub_tensor = None
2025-05-07T20:32:14.8816930Z     
2025-05-07T20:32:14.8817169Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.8817490Z             op = silu_mul_quant
2025-05-07T20:32:14.8817741Z             if compiled:
2025-05-07T20:32:14.8817989Z                 op = torch.compile(op)
2025-05-07T20:32:14.8818288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8818571Z     
2025-05-07T20:32:14.8818762Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.8819048Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.8819413Z     
2025-05-07T20:32:14.8819652Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.8820045Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.8820339Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.8820649Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.8821012Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.8821325Z     
2025-05-07T20:32:14.8821531Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.8821725Z 
2025-05-07T20:32:14.8821826Z moe/activation_test.py:126: 
2025-05-07T20:32:14.8822123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8822463Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.8822786Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.8823585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.8824346Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.8824969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.8825651Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.8826344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.8827071Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.8827822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.8828575Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.8829362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.8830010Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.8830613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.8831141Z     fn()
2025-05-07T20:32:14.8831656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.8832240Z     self.fn.run(
2025-05-07T20:32:14.8832707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.8833244Z     kernel = self.compile(
2025-05-07T20:32:14.8833793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.8834447Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.8834847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8835087Z 
2025-05-07T20:32:14.8835343Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3d90b90>
2025-05-07T20:32:14.8836435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.8837800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc0bb060>}
2025-05-07T20:32:14.8839370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.8840402Z context = <triton._C.libtriton.ir.context object at 0x7f18a3ddafb0>
2025-05-07T20:32:14.8840690Z 
2025-05-07T20:32:14.8840935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.8841543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.8842008Z                            module_map=module_map)
2025-05-07T20:32:14.8842380Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.8842750Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.8843015Z E       ^
2025-05-07T20:32:14.8843579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.8844033Z 
2025-05-07T20:32:14.8844460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.8844986Z 
2025-05-07T20:32:14.8845100Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.8845507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.8845915Z     T=128,
2025-05-07T20:32:14.8846108Z     D=7168,
2025-05-07T20:32:14.8846305Z     scale_ub=None,
2025-05-07T20:32:14.8846596Z     contiguous=False,
2025-05-07T20:32:14.8846828Z     compiled=False,
2025-05-07T20:32:14.8847028Z )
2025-05-07T20:32:15.2007450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2007991Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.2008286Z 
2025-05-07T20:32:15.2008375Z     @given(
2025-05-07T20:32:15.2008657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2008977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2009282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2009605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2009932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2010215Z     )
2025-05-07T20:32:15.2010563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2011004Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2011255Z         self,
2025-05-07T20:32:15.2011447Z         T: int,
2025-05-07T20:32:15.2011636Z         D: int,
2025-05-07T20:32:15.2011851Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2012120Z         contiguous: bool,
2025-05-07T20:32:15.2012357Z         compiled: bool,
2025-05-07T20:32:15.2012581Z     ) -> None:
2025-05-07T20:32:15.2012794Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2013032Z     
2025-05-07T20:32:15.2013303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2013646Z     
2025-05-07T20:32:15.2013832Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2014119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2014428Z         x = x_sign * x_clamp
2025-05-07T20:32:15.2014663Z         x0 = x[:, :D]
2025-05-07T20:32:15.2014880Z         x1 = x[:, D:]
2025-05-07T20:32:15.2015084Z     
2025-05-07T20:32:15.2015265Z         if contiguous:
2025-05-07T20:32:15.2015617Z             x0 = x0.contiguous()
2025-05-07T20:32:15.2015882Z             x1 = x1.contiguous()
2025-05-07T20:32:15.2016118Z     
2025-05-07T20:32:15.2016308Z         if scale_ub is not None:
2025-05-07T20:32:15.2016588Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.2016924Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.2017231Z             )
2025-05-07T20:32:15.2017426Z         else:
2025-05-07T20:32:15.2017635Z             scale_ub_tensor = None
2025-05-07T20:32:15.2017884Z     
2025-05-07T20:32:15.2018119Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.2018463Z             op = silu_mul_quant
2025-05-07T20:32:15.2018731Z             if compiled:
2025-05-07T20:32:15.2018981Z                 op = torch.compile(op)
2025-05-07T20:32:15.2019277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2019548Z     
2025-05-07T20:32:15.2019832Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.2019997Z 
2025-05-07T20:32:15.2020164Z moe/activation_test.py:117: 
2025-05-07T20:32:15.2020456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2020795Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.2021085Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2021781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.2022468Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.2023009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.2023701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.2024369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.2024900Z     kernel = self.compile(
2025-05-07T20:32:15.2025456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.2026197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.2026585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2026816Z 
2025-05-07T20:32:15.2027028Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3c70490>
2025-05-07T20:32:15.2028104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.2029513Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3a88cc0>}
2025-05-07T20:32:15.2030855Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.2031877Z context = <triton._C.libtriton.ir.context object at 0x7f18a3c64af0>
2025-05-07T20:32:15.2032168Z 
2025-05-07T20:32:15.2032336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.2032858Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.2033326Z                            module_map=module_map)
2025-05-07T20:32:15.2033683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.2034037Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.2034296Z E       ^
2025-05-07T20:32:15.2034755Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.2035210Z 
2025-05-07T20:32:15.2035680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2036219Z 
2025-05-07T20:32:15.2036319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2036732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2037124Z     T=4096,
2025-05-07T20:32:15.2037309Z     D=5120,
2025-05-07T20:32:15.2037499Z     scale_ub=1200.0,
2025-05-07T20:32:15.2037718Z     contiguous=True,
2025-05-07T20:32:15.2037938Z     compiled=False,
2025-05-07T20:32:15.2038141Z )
2025-05-07T20:32:15.2038715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2039238Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.2039517Z 
2025-05-07T20:32:15.2039598Z     @given(
2025-05-07T20:32:15.2039832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2040139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2040514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2040901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2041234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2041517Z     )
2025-05-07T20:32:15.2041871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2042311Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2042544Z         self,
2025-05-07T20:32:15.2042739Z         T: int,
2025-05-07T20:32:15.2042936Z         D: int,
2025-05-07T20:32:15.2043146Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2043471Z         contiguous: bool,
2025-05-07T20:32:15.2043714Z         compiled: bool,
2025-05-07T20:32:15.2043935Z     ) -> None:
2025-05-07T20:32:15.2044150Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2044396Z     
2025-05-07T20:32:15.2044666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2045005Z     
2025-05-07T20:32:15.2045199Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2045487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2045868Z         x = x_sign * x_clamp
2025-05-07T20:32:15.2046103Z         x0 = x[:, :D]
2025-05-07T20:32:15.2046322Z         x1 = x[:, D:]
2025-05-07T20:32:15.2046522Z     
2025-05-07T20:32:15.2046702Z         if contiguous:
2025-05-07T20:32:15.2046933Z             x0 = x0.contiguous()
2025-05-07T20:32:15.2047188Z             x1 = x1.contiguous()
2025-05-07T20:32:15.2047427Z     
2025-05-07T20:32:15.2047615Z         if scale_ub is not None:
2025-05-07T20:32:15.2047880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.2048211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.2048523Z             )
2025-05-07T20:32:15.2048735Z         else:
2025-05-07T20:32:15.2048968Z             scale_ub_tensor = None
2025-05-07T20:32:15.2049218Z     
2025-05-07T20:32:15.2049441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.2049755Z             op = silu_mul_quant
2025-05-07T20:32:15.2050010Z             if compiled:
2025-05-07T20:32:15.2050250Z                 op = torch.compile(op)
2025-05-07T20:32:15.2050545Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2050817Z     
2025-05-07T20:32:15.2051006Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.2051166Z 
2025-05-07T20:32:15.2051264Z moe/activation_test.py:117: 
2025-05-07T20:32:15.2051553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2051884Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.2052156Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2052847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.2053540Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.2054082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.2054833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.2055506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.2056043Z     kernel = self.compile(
2025-05-07T20:32:15.2056576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.2057237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.2057636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2057861Z 
2025-05-07T20:32:15.2058071Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3ca5010>
2025-05-07T20:32:15.2059234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.2060640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3a89f80>}
2025-05-07T20:32:15.2061980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.2063003Z context = <triton._C.libtriton.ir.context object at 0x7f18a3c35630>
2025-05-07T20:32:15.2063286Z 
2025-05-07T20:32:15.2063458Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.2063971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.2064437Z                            module_map=module_map)
2025-05-07T20:32:15.2064803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.2065149Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.2065455Z E       ^
2025-05-07T20:32:15.2065922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.2066371Z 
2025-05-07T20:32:15.2066790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2067303Z 
2025-05-07T20:32:15.2067406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2067816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2068216Z     T=1,
2025-05-07T20:32:15.2068392Z     D=5120,
2025-05-07T20:32:15.2068582Z     scale_ub=None,
2025-05-07T20:32:15.2068794Z     contiguous=True,
2025-05-07T20:32:15.2069006Z     compiled=True,
2025-05-07T20:32:15.2069205Z )
2025-05-07T20:32:15.6548453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6549040Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:15.6549316Z 
2025-05-07T20:32:15.6549399Z     @given(
2025-05-07T20:32:15.6549641Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6549961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6550277Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6550609Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6550944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6551237Z     )
2025-05-07T20:32:15.6551588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6552035Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6552283Z         self,
2025-05-07T20:32:15.6552475Z         T: int,
2025-05-07T20:32:15.6552677Z         D: int,
2025-05-07T20:32:15.6552905Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6553177Z         contiguous: bool,
2025-05-07T20:32:15.6553427Z         compiled: bool,
2025-05-07T20:32:15.6553809Z     ) -> None:
2025-05-07T20:32:15.6554029Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6554273Z     
2025-05-07T20:32:15.6554550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6554899Z     
2025-05-07T20:32:15.6561593Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.6561914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.6562231Z         x = x_sign * x_clamp
2025-05-07T20:32:15.6562484Z         x0 = x[:, :D]
2025-05-07T20:32:15.6562710Z         x1 = x[:, D:]
2025-05-07T20:32:15.6562918Z     
2025-05-07T20:32:15.6563117Z         if contiguous:
2025-05-07T20:32:15.6563478Z             x0 = x0.contiguous()
2025-05-07T20:32:15.6563737Z             x1 = x1.contiguous()
2025-05-07T20:32:15.6563985Z     
2025-05-07T20:32:15.6564184Z         if scale_ub is not None:
2025-05-07T20:32:15.6564455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.6564914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.6565299Z             )
2025-05-07T20:32:15.6565495Z         else:
2025-05-07T20:32:15.6565704Z             scale_ub_tensor = None
2025-05-07T20:32:15.6565965Z     
2025-05-07T20:32:15.6566207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.6566520Z             op = silu_mul_quant
2025-05-07T20:32:15.6566782Z             if compiled:
2025-05-07T20:32:15.6567035Z                 op = torch.compile(op)
2025-05-07T20:32:15.6567327Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.6567610Z     
2025-05-07T20:32:15.6567807Z         y_fp8, y_scale = fn()
2025-05-07T20:32:15.6568090Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:15.6568385Z     
2025-05-07T20:32:15.6568641Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.6569021Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:15.6569326Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:15.6569645Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:15.6570087Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.6570398Z     
2025-05-07T20:32:15.6570605Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:15.6570800Z 
2025-05-07T20:32:15.6570915Z moe/activation_test.py:126: 
2025-05-07T20:32:15.6571208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.6571549Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:15.6571879Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.6572670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:15.6573434Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:15.6573988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.6574683Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.6575380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:15.6576107Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.6576867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:15.6577623Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.6578375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:15.6579045Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:15.6579657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:15.6580180Z     fn()
2025-05-07T20:32:15.6580738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:15.6581338Z     self.fn.run(
2025-05-07T20:32:15.6581810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.6582337Z     kernel = self.compile(
2025-05-07T20:32:15.6582886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.6583548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.6583951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.6584179Z 
2025-05-07T20:32:15.6584387Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a310f190>
2025-05-07T20:32:15.6585519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.6586938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3a8afc0>}
2025-05-07T20:32:15.6588285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.6589357Z context = <triton._C.libtriton.ir.context object at 0x7f18a31e79f0>
2025-05-07T20:32:15.6589654Z 
2025-05-07T20:32:15.6589822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.6590355Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.6590832Z                            module_map=module_map)
2025-05-07T20:32:15.6591199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.6591605Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:15.6591879Z E       ^
2025-05-07T20:32:15.6592342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.6592800Z 
2025-05-07T20:32:15.6593223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.6593746Z 
2025-05-07T20:32:15.6593850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6594270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6594669Z     T=2048,
2025-05-07T20:32:15.6594873Z     D=5120,
2025-05-07T20:32:15.6595071Z     scale_ub=None,
2025-05-07T20:32:15.6595281Z     contiguous=True,
2025-05-07T20:32:15.6595507Z     compiled=True,
2025-05-07T20:32:15.6595705Z )
2025-05-07T20:32:16.0931123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.0931662Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.0931933Z 
2025-05-07T20:32:16.0932021Z     @given(
2025-05-07T20:32:16.0932262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.0932602Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.0932916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.0933253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.0933579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.0933877Z     )
2025-05-07T20:32:16.0934237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.0934679Z     def test_silu_mul_quant(
2025-05-07T20:32:16.0934928Z         self,
2025-05-07T20:32:16.0935128Z         T: int,
2025-05-07T20:32:16.0935324Z         D: int,
2025-05-07T20:32:16.0935549Z         scale_ub: Optional[float],
2025-05-07T20:32:16.0935945Z         contiguous: bool,
2025-05-07T20:32:16.0936193Z         compiled: bool,
2025-05-07T20:32:16.0936418Z     ) -> None:
2025-05-07T20:32:16.0936638Z         torch.manual_seed(2025)
2025-05-07T20:32:16.0936883Z     
2025-05-07T20:32:16.0937154Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.0937503Z     
2025-05-07T20:32:16.0937704Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.0937993Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.0938305Z         x = x_sign * x_clamp
2025-05-07T20:32:16.0938760Z         x0 = x[:, :D]
2025-05-07T20:32:16.0938975Z         x1 = x[:, D:]
2025-05-07T20:32:16.0939190Z     
2025-05-07T20:32:16.0939384Z         if contiguous:
2025-05-07T20:32:16.0939618Z             x0 = x0.contiguous()
2025-05-07T20:32:16.0939885Z             x1 = x1.contiguous()
2025-05-07T20:32:16.0940132Z     
2025-05-07T20:32:16.0940394Z         if scale_ub is not None:
2025-05-07T20:32:16.0940682Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.0941086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.0941403Z             )
2025-05-07T20:32:16.0941597Z         else:
2025-05-07T20:32:16.0941810Z             scale_ub_tensor = None
2025-05-07T20:32:16.0942071Z     
2025-05-07T20:32:16.0943731Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.0944051Z             op = silu_mul_quant
2025-05-07T20:32:16.0944304Z             if compiled:
2025-05-07T20:32:16.0944550Z                 op = torch.compile(op)
2025-05-07T20:32:16.0944852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.0945137Z     
2025-05-07T20:32:16.0945328Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.0945621Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.0945920Z     
2025-05-07T20:32:16.0946157Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.0946503Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.0946804Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.0947199Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.0947559Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.0947874Z     
2025-05-07T20:32:16.0948083Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.0948276Z 
2025-05-07T20:32:16.0948378Z moe/activation_test.py:126: 
2025-05-07T20:32:16.0948681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.0949020Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.0949344Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.0950137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.0950897Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.0951454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.0952143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.0952840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.0953570Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.0954329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:16.0955077Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.0955813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.0956455Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.0957128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.0957672Z     fn()
2025-05-07T20:32:16.0958188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.0958780Z     self.fn.run(
2025-05-07T20:32:16.0959300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.0959839Z     kernel = self.compile(
2025-05-07T20:32:16.0960389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.0961044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.0961443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.0961679Z 
2025-05-07T20:32:16.0961935Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3425fd0>
2025-05-07T20:32:16.0963019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.0964543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a37bf420>}
2025-05-07T20:32:16.0965875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.0966907Z context = <triton._C.libtriton.ir.context object at 0x7f18a3419470>
2025-05-07T20:32:16.0967198Z 
2025-05-07T20:32:16.0967368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.0967898Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.0968511Z                            module_map=module_map)
2025-05-07T20:32:16.0968880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.0969240Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.0969509Z E       ^
2025-05-07T20:32:16.0969977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.0970437Z 
2025-05-07T20:32:16.0970859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.0971380Z 
2025-05-07T20:32:16.0971493Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.0971905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.0972314Z     T=128,
2025-05-07T20:32:16.0972508Z     D=5120,
2025-05-07T20:32:16.0972700Z     scale_ub=None,
2025-05-07T20:32:16.0972924Z     contiguous=True,
2025-05-07T20:32:16.0973159Z     compiled=True,
2025-05-07T20:32:16.0973367Z )
2025-05-07T20:32:16.7668129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.7669083Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.7669370Z 
2025-05-07T20:32:16.7669468Z     @given(
2025-05-07T20:32:16.7669717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.7670047Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.7670365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.7670710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.7671053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.7671350Z     )
2025-05-07T20:32:16.7671703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.7672160Z     def test_silu_mul_quant(
2025-05-07T20:32:16.7672426Z         self,
2025-05-07T20:32:16.7672628Z         T: int,
2025-05-07T20:32:16.7672968Z         D: int,
2025-05-07T20:32:16.7673210Z         scale_ub: Optional[float],
2025-05-07T20:32:16.7673494Z         contiguous: bool,
2025-05-07T20:32:16.7673746Z         compiled: bool,
2025-05-07T20:32:16.7673978Z     ) -> None:
2025-05-07T20:32:16.7674199Z         torch.manual_seed(2025)
2025-05-07T20:32:16.7674448Z     
2025-05-07T20:32:16.7674728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.7675068Z     
2025-05-07T20:32:16.7675264Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.7675564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.7675881Z         x = x_sign * x_clamp
2025-05-07T20:32:16.7676115Z         x0 = x[:, :D]
2025-05-07T20:32:16.7676336Z         x1 = x[:, D:]
2025-05-07T20:32:16.7676551Z     
2025-05-07T20:32:16.7676738Z         if contiguous:
2025-05-07T20:32:16.7676973Z             x0 = x0.contiguous()
2025-05-07T20:32:16.7677301Z             x1 = x1.contiguous()
2025-05-07T20:32:16.7677592Z     
2025-05-07T20:32:16.7677802Z         if scale_ub is not None:
2025-05-07T20:32:16.7678078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.7678413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.7678725Z             )
2025-05-07T20:32:16.7678950Z         else:
2025-05-07T20:32:16.7679185Z             scale_ub_tensor = None
2025-05-07T20:32:16.7679445Z     
2025-05-07T20:32:16.7679684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.7679992Z             op = silu_mul_quant
2025-05-07T20:32:16.7680244Z             if compiled:
2025-05-07T20:32:16.7680497Z                 op = torch.compile(op)
2025-05-07T20:32:16.7680792Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.7681079Z     
2025-05-07T20:32:16.7681284Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.7681577Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.7681872Z     
2025-05-07T20:32:16.7682119Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.7682533Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.7682829Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.7683151Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.7683595Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.7683907Z     
2025-05-07T20:32:16.7684119Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.7684313Z 
2025-05-07T20:32:16.7684423Z moe/activation_test.py:126: 
2025-05-07T20:32:16.7684730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.7685071Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.7685402Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.7686204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.7686965Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.7687529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.7688226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.7688977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.7689702Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.7690462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:16.7691223Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.7691968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.7692654Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.7693275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.7693807Z     fn()
2025-05-07T20:32:16.7694318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.7694911Z     self.fn.run(
2025-05-07T20:32:16.7695388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.7695924Z     kernel = self.compile(
2025-05-07T20:32:16.7696468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.7697130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.7697573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.7697804Z 
2025-05-07T20:32:16.7698058Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2e55d90>
2025-05-07T20:32:16.7699197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.7700566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3045e40>}
2025-05-07T20:32:16.7701911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.7702946Z context = <triton._C.libtriton.ir.context object at 0x7f18a2e662b0>
2025-05-07T20:32:16.7703232Z 
2025-05-07T20:32:16.7703406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.7703942Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.7704455Z                            module_map=module_map)
2025-05-07T20:32:16.7704824Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.7705183Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.7705458Z E       ^
2025-05-07T20:32:16.7705929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.7706386Z 
2025-05-07T20:32:16.7706806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.7707331Z 
2025-05-07T20:32:16.7707439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.7707855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.7708266Z     T=4096,
2025-05-07T20:32:16.7708455Z     D=5120,
2025-05-07T20:32:16.7708656Z     scale_ub=None,
2025-05-07T20:32:16.7708880Z     contiguous=True,
2025-05-07T20:32:16.7709103Z     compiled=True,
2025-05-07T20:32:16.7709317Z )
2025-05-07T20:32:17.2825802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.2826858Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.2827406Z 
2025-05-07T20:32:17.2827559Z     @given(
2025-05-07T20:32:17.2828017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.2828638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.2829247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.2829633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.2829956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.2830242Z     )
2025-05-07T20:32:17.2830598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.2831149Z     def test_silu_mul_quant(
2025-05-07T20:32:17.2831411Z         self,
2025-05-07T20:32:17.2831614Z         T: int,
2025-05-07T20:32:17.2831809Z         D: int,
2025-05-07T20:32:17.2832032Z         scale_ub: Optional[float],
2025-05-07T20:32:17.2832312Z         contiguous: bool,
2025-05-07T20:32:17.2832551Z         compiled: bool,
2025-05-07T20:32:17.2832779Z     ) -> None:
2025-05-07T20:32:17.2833000Z         torch.manual_seed(2025)
2025-05-07T20:32:17.2833245Z     
2025-05-07T20:32:17.2833516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.2833868Z     
2025-05-07T20:32:17.2834067Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.2834353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.2834666Z         x = x_sign * x_clamp
2025-05-07T20:32:17.2834910Z         x0 = x[:, :D]
2025-05-07T20:32:17.2835122Z         x1 = x[:, D:]
2025-05-07T20:32:17.2835338Z     
2025-05-07T20:32:17.2835594Z         if contiguous:
2025-05-07T20:32:17.2835828Z             x0 = x0.contiguous()
2025-05-07T20:32:17.2836156Z             x1 = x1.contiguous()
2025-05-07T20:32:17.2836401Z     
2025-05-07T20:32:17.2836589Z         if scale_ub is not None:
2025-05-07T20:32:17.2836861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.2837201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.2837508Z             )
2025-05-07T20:32:17.2837703Z         else:
2025-05-07T20:32:17.2837916Z             scale_ub_tensor = None
2025-05-07T20:32:17.2838170Z     
2025-05-07T20:32:17.2838566Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.2838885Z             op = silu_mul_quant
2025-05-07T20:32:17.2839143Z             if compiled:
2025-05-07T20:32:17.2839388Z                 op = torch.compile(op)
2025-05-07T20:32:17.2839687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.2839966Z     
2025-05-07T20:32:17.2840160Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.2840451Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.2840828Z     
2025-05-07T20:32:17.2841065Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.2841403Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.2841705Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.2842019Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.2842383Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.2842698Z     
2025-05-07T20:32:17.2842906Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.2843103Z 
2025-05-07T20:32:17.2843282Z moe/activation_test.py:126: 
2025-05-07T20:32:17.2843581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2843915Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.2844238Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.2845036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.2845801Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.2846358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.2847040Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.2847738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.2848474Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.2849233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.2849983Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.2850818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.2851470Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.2852067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.2852589Z     fn()
2025-05-07T20:32:17.2853100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.2853682Z     self.fn.run(
2025-05-07T20:32:17.2854150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.2854683Z     kernel = self.compile(
2025-05-07T20:32:17.2855228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.2855973Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.2856378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2856672Z 
2025-05-07T20:32:17.2856882Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3871510>
2025-05-07T20:32:17.2857957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.2859329Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3352840>}
2025-05-07T20:32:17.2860664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.2861694Z context = <triton._C.libtriton.ir.context object at 0x7f18a3875ab0>
2025-05-07T20:32:17.2861984Z 
2025-05-07T20:32:17.2862206Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.2862727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.2863190Z                            module_map=module_map)
2025-05-07T20:32:17.2863559Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.2863921Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.2864184Z E       ^
2025-05-07T20:32:17.2864651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.2865104Z 
2025-05-07T20:32:17.2871815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.2872538Z 
2025-05-07T20:32:17.2872712Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.2873177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.2873588Z     T=16384,
2025-05-07T20:32:17.2873789Z     D=5120,
2025-05-07T20:32:17.2873986Z     scale_ub=None,
2025-05-07T20:32:17.2874198Z     contiguous=True,
2025-05-07T20:32:17.2874420Z     compiled=True,
2025-05-07T20:32:17.2874627Z )
2025-05-07T20:32:17.3126653Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:17.3127900Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:17.3129239Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:17.3130348Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:17.3131461Z W0507 20:32:17.311000 87828 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:17.3823238Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.3824295Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.3824847Z 
2025-05-07T20:32:17.3825006Z     @given(
2025-05-07T20:32:17.3825468Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.3826088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.3826685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.3827339Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.3827997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.3828728Z     )
2025-05-07T20:32:17.3829384Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.3829944Z     def test_silu_mul_quant(
2025-05-07T20:32:17.3830188Z         self,
2025-05-07T20:32:17.3830391Z         T: int,
2025-05-07T20:32:17.3830597Z         D: int,
2025-05-07T20:32:17.3830825Z         scale_ub: Optional[float],
2025-05-07T20:32:17.3831101Z         contiguous: bool,
2025-05-07T20:32:17.3831350Z         compiled: bool,
2025-05-07T20:32:17.3831585Z     ) -> None:
2025-05-07T20:32:17.3831800Z         torch.manual_seed(2025)
2025-05-07T20:32:17.3832045Z     
2025-05-07T20:32:17.3832329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.3832675Z     
2025-05-07T20:32:17.3832877Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.3833177Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.3833485Z         x = x_sign * x_clamp
2025-05-07T20:32:17.3833733Z         x0 = x[:, :D]
2025-05-07T20:32:17.3833951Z         x1 = x[:, D:]
2025-05-07T20:32:17.3834160Z     
2025-05-07T20:32:17.3834422Z         if contiguous:
2025-05-07T20:32:17.3834665Z             x0 = x0.contiguous()
2025-05-07T20:32:17.3834923Z             x1 = x1.contiguous()
2025-05-07T20:32:17.3835171Z     
2025-05-07T20:32:17.3835371Z         if scale_ub is not None:
2025-05-07T20:32:17.3835642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.3835982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.3836298Z             )
2025-05-07T20:32:17.3836497Z         else:
2025-05-07T20:32:17.3836708Z             scale_ub_tensor = None
2025-05-07T20:32:17.3836964Z     
2025-05-07T20:32:17.3837208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.3837518Z             op = silu_mul_quant
2025-05-07T20:32:17.3837771Z             if compiled:
2025-05-07T20:32:17.3838025Z                 op = torch.compile(op)
2025-05-07T20:32:17.3838322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.3838755Z     
2025-05-07T20:32:17.3838956Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.3839242Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.3839538Z     
2025-05-07T20:32:17.3839785Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.3840118Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.3840419Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.3840739Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.3841100Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.3841407Z     
2025-05-07T20:32:17.3841615Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.3841810Z 
2025-05-07T20:32:17.3841919Z moe/activation_test.py:126: 
2025-05-07T20:32:17.3842218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.3842559Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.3842967Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.3843919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.3844685Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.3845236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.3845921Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.3846613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.3847344Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.3848100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.3848916Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.3849710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.3850351Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.3850955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.3851470Z     fn()
2025-05-07T20:32:17.3851979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.3852562Z     self.fn.run(
2025-05-07T20:32:17.3853028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.3853560Z     kernel = self.compile(
2025-05-07T20:32:17.3854108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.3854772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.3855237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.3855471Z 
2025-05-07T20:32:17.3855678Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a353fe50>
2025-05-07T20:32:17.3856761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.3858130Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2b19d00>}
2025-05-07T20:32:17.3859470Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.3860508Z context = <triton._C.libtriton.ir.context object at 0x7f18a3567ef0>
2025-05-07T20:32:17.3860801Z 
2025-05-07T20:32:17.3860971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.3861498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.3861959Z                            module_map=module_map)
2025-05-07T20:32:17.3862328Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.3862688Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.3862950Z E       ^
2025-05-07T20:32:17.3863416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.3863872Z 
2025-05-07T20:32:17.3864292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.3864811Z 
2025-05-07T20:32:17.3864969Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.3865384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.3865787Z     T=1,
2025-05-07T20:32:17.3865974Z     D=5120,
2025-05-07T20:32:17.3866164Z     scale_ub=1200.0,
2025-05-07T20:32:17.3866386Z     contiguous=True,
2025-05-07T20:32:17.3866608Z     compiled=True,
2025-05-07T20:32:17.3866814Z )
2025-05-07T20:32:17.6696519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.6697035Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.6697327Z 
2025-05-07T20:32:17.6697410Z     @given(
2025-05-07T20:32:17.6697649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.6697961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.6698272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.6698715Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.6699051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.6699395Z     )
2025-05-07T20:32:17.6699947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.6700796Z     def test_silu_mul_quant(
2025-05-07T20:32:17.6701224Z         self,
2025-05-07T20:32:17.6701580Z         T: int,
2025-05-07T20:32:17.6701936Z         D: int,
2025-05-07T20:32:17.6702325Z         scale_ub: Optional[float],
2025-05-07T20:32:17.6702822Z         contiguous: bool,
2025-05-07T20:32:17.6703260Z         compiled: bool,
2025-05-07T20:32:17.6703702Z     ) -> None:
2025-05-07T20:32:17.6704086Z         torch.manual_seed(2025)
2025-05-07T20:32:17.6704532Z     
2025-05-07T20:32:17.6705036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.6705650Z     
2025-05-07T20:32:17.6705999Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.6706539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.6707094Z         x = x_sign * x_clamp
2025-05-07T20:32:17.6707678Z         x0 = x[:, :D]
2025-05-07T20:32:17.6708075Z         x1 = x[:, D:]
2025-05-07T20:32:17.6708457Z     
2025-05-07T20:32:17.6708796Z         if contiguous:
2025-05-07T20:32:17.6709228Z             x0 = x0.contiguous()
2025-05-07T20:32:17.6709541Z             x1 = x1.contiguous()
2025-05-07T20:32:17.6709786Z     
2025-05-07T20:32:17.6709989Z         if scale_ub is not None:
2025-05-07T20:32:17.6710269Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.6710601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.6710919Z             )
2025-05-07T20:32:17.6711120Z         else:
2025-05-07T20:32:17.6711331Z             scale_ub_tensor = None
2025-05-07T20:32:17.6711586Z     
2025-05-07T20:32:17.6711823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.6712134Z             op = silu_mul_quant
2025-05-07T20:32:17.6712390Z             if compiled:
2025-05-07T20:32:17.6712649Z                 op = torch.compile(op)
2025-05-07T20:32:17.6712958Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.6713241Z     
2025-05-07T20:32:17.6713439Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.6713604Z 
2025-05-07T20:32:17.6713716Z moe/activation_test.py:117: 
2025-05-07T20:32:17.6714012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.6714351Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.6714639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.6715201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.6715772Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.6716444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.6717135Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.6717748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.6718441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.6719113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.6719644Z     kernel = self.compile(
2025-05-07T20:32:17.6720191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.6720858Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.6721261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.6721494Z 
2025-05-07T20:32:17.6721706Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a21be650>
2025-05-07T20:32:17.6722834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.6724414Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2d68220>}
2025-05-07T20:32:17.6725755Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.6726782Z context = <triton._C.libtriton.ir.context object at 0x7f18a20961b0>
2025-05-07T20:32:17.6727069Z 
2025-05-07T20:32:17.6727239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.6727762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.6728242Z                            module_map=module_map)
2025-05-07T20:32:17.6728618Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.6729038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.6729334Z E       ^
2025-05-07T20:32:17.6729801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.6730252Z 
2025-05-07T20:32:17.6730670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.6731182Z 
2025-05-07T20:32:17.6731287Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.6731705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.6732110Z     T=1,
2025-05-07T20:32:17.6732290Z     D=5120,
2025-05-07T20:32:17.6732488Z     scale_ub=None,
2025-05-07T20:32:17.6732707Z     contiguous=False,
2025-05-07T20:32:17.6732927Z     compiled=True,
2025-05-07T20:32:17.6733130Z )
2025-05-07T20:32:17.7208294Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.7208843Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.7209108Z 
2025-05-07T20:32:17.7209193Z     @given(
2025-05-07T20:32:17.7209649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.7210283Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.7210887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.7211531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.7213616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.7214176Z     )
2025-05-07T20:32:17.7214862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.7215738Z     def test_silu_mul_quant(
2025-05-07T20:32:17.7216210Z         self,
2025-05-07T20:32:17.7216580Z         T: int,
2025-05-07T20:32:17.7216970Z         D: int,
2025-05-07T20:32:17.7217404Z         scale_ub: Optional[float],
2025-05-07T20:32:17.7218086Z         contiguous: bool,
2025-05-07T20:32:17.7218577Z         compiled: bool,
2025-05-07T20:32:17.7219024Z     ) -> None:
2025-05-07T20:32:17.7219243Z         torch.manual_seed(2025)
2025-05-07T20:32:17.7219478Z     
2025-05-07T20:32:17.7219756Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.7220095Z     
2025-05-07T20:32:17.7220286Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.7220577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.7220882Z         x = x_sign * x_clamp
2025-05-07T20:32:17.7221114Z         x0 = x[:, :D]
2025-05-07T20:32:17.7221331Z         x1 = x[:, D:]
2025-05-07T20:32:17.7221543Z     
2025-05-07T20:32:17.7221723Z         if contiguous:
2025-05-07T20:32:17.7221954Z             x0 = x0.contiguous()
2025-05-07T20:32:17.7222213Z             x1 = x1.contiguous()
2025-05-07T20:32:17.7222452Z     
2025-05-07T20:32:17.7222647Z         if scale_ub is not None:
2025-05-07T20:32:17.7222989Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.7223373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.7223688Z             )
2025-05-07T20:32:17.7223889Z         else:
2025-05-07T20:32:17.7224096Z             scale_ub_tensor = None
2025-05-07T20:32:17.7224348Z     
2025-05-07T20:32:17.7224584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.7224900Z             op = silu_mul_quant
2025-05-07T20:32:17.7225145Z             if compiled:
2025-05-07T20:32:17.7225396Z                 op = torch.compile(op)
2025-05-07T20:32:17.7225695Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.7225967Z     
2025-05-07T20:32:17.7226158Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.7226446Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.7226738Z     
2025-05-07T20:32:17.7226976Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.7227312Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.7227608Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.7227992Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.7228350Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.7228662Z     
2025-05-07T20:32:17.7228860Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.7229057Z 
2025-05-07T20:32:17.7229157Z moe/activation_test.py:126: 
2025-05-07T20:32:17.7229452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.7229781Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.7230108Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.7230899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.7231653Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.7232195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.7232896Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.7233587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.7234307Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.7235063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.7235812Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.7236538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.7237173Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.7237819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.7238348Z     fn()
2025-05-07T20:32:17.7239020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.7239595Z     self.fn.run(
2025-05-07T20:32:17.7240064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.7240608Z     kernel = self.compile(
2025-05-07T20:32:17.7241146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.7241806Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.7242209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.7242437Z 
2025-05-07T20:32:17.7242648Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a20b95d0>
2025-05-07T20:32:17.7243900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.7245314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2d8e200>}
2025-05-07T20:32:17.7246657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.7247682Z context = <triton._C.libtriton.ir.context object at 0x7f18a203db70>
2025-05-07T20:32:17.7247967Z 
2025-05-07T20:32:17.7248137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.7248666Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.7249138Z                            module_map=module_map)
2025-05-07T20:32:17.7249580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.7249948Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.7250214Z E       ^
2025-05-07T20:32:17.7250675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.7251124Z 
2025-05-07T20:32:17.7251547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.7252059Z 
2025-05-07T20:32:17.7252165Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.7252578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.7252979Z     T=1,
2025-05-07T20:32:17.7253159Z     D=5120,
2025-05-07T20:32:17.7253353Z     scale_ub=None,
2025-05-07T20:32:17.7253570Z     contiguous=True,
2025-05-07T20:32:17.7253788Z     compiled=False,
2025-05-07T20:32:17.7253998Z )
2025-05-07T20:32:17.8410038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8410561Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.8410830Z 
2025-05-07T20:32:17.8410922Z     @given(
2025-05-07T20:32:17.8411151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8411467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8411771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8412096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8412425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8412707Z     )
2025-05-07T20:32:17.8413050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8413492Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8413736Z         self,
2025-05-07T20:32:17.8413933Z         T: int,
2025-05-07T20:32:17.8414126Z         D: int,
2025-05-07T20:32:17.8414458Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8414737Z         contiguous: bool,
2025-05-07T20:32:17.8414966Z         compiled: bool,
2025-05-07T20:32:17.8415190Z     ) -> None:
2025-05-07T20:32:17.8415406Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8415644Z     
2025-05-07T20:32:17.8415917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8416255Z     
2025-05-07T20:32:17.8416442Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8416732Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8417042Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8417276Z         x0 = x[:, :D]
2025-05-07T20:32:17.8417492Z         x1 = x[:, D:]
2025-05-07T20:32:17.8417702Z     
2025-05-07T20:32:17.8417884Z         if contiguous:
2025-05-07T20:32:17.8418117Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8418439Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8418676Z     
2025-05-07T20:32:17.8418923Z         if scale_ub is not None:
2025-05-07T20:32:17.8419204Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8419538Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8419840Z             )
2025-05-07T20:32:17.8420034Z         else:
2025-05-07T20:32:17.8420248Z             scale_ub_tensor = None
2025-05-07T20:32:17.8420492Z     
2025-05-07T20:32:17.8420723Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8421037Z             op = silu_mul_quant
2025-05-07T20:32:17.8421285Z             if compiled:
2025-05-07T20:32:17.8421528Z                 op = torch.compile(op)
2025-05-07T20:32:17.8421830Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8422100Z     
2025-05-07T20:32:17.8422288Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.8422448Z 
2025-05-07T20:32:17.8422551Z moe/activation_test.py:117: 
2025-05-07T20:32:17.8422849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8423248Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.8423535Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8424223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.8424905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.8425441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8426124Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8426791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8427320Z     kernel = self.compile(
2025-05-07T20:32:17.8427861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8428522Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8428918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8429151Z 
2025-05-07T20:32:17.8429360Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a20e5650>
2025-05-07T20:32:17.8430433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8431800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2d8f560>}
2025-05-07T20:32:17.8433143Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8434214Z context = <triton._C.libtriton.ir.context object at 0x7f18a20c1b70>
2025-05-07T20:32:17.8434510Z 
2025-05-07T20:32:17.8434677Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8435196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8435660Z                            module_map=module_map)
2025-05-07T20:32:17.8436022Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8436376Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.8436633Z E       ^
2025-05-07T20:32:17.8437092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8437545Z 
2025-05-07T20:32:17.8437962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8438657Z 
2025-05-07T20:32:17.8438862Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8445653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8446081Z     T=128,
2025-05-07T20:32:17.8446265Z     D=5120,
2025-05-07T20:32:17.8446461Z     scale_ub=None,
2025-05-07T20:32:17.8446679Z     contiguous=False,
2025-05-07T20:32:17.8446901Z     compiled=True,
2025-05-07T20:32:17.8447103Z )
2025-05-07T20:32:17.8447434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8447924Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.8448196Z 
2025-05-07T20:32:17.8448274Z     @given(
2025-05-07T20:32:17.8448501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8448815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8449114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8449439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8449774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8450173Z     )
2025-05-07T20:32:17.8450529Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8450968Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8451211Z         self,
2025-05-07T20:32:17.8451401Z         T: int,
2025-05-07T20:32:17.8451594Z         D: int,
2025-05-07T20:32:17.8451811Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8452072Z         contiguous: bool,
2025-05-07T20:32:17.8452309Z         compiled: bool,
2025-05-07T20:32:17.8452529Z     ) -> None:
2025-05-07T20:32:17.8452740Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8452976Z     
2025-05-07T20:32:17.8453246Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8453581Z     
2025-05-07T20:32:17.8453773Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8454062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8454367Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8454610Z         x0 = x[:, :D]
2025-05-07T20:32:17.8454826Z         x1 = x[:, D:]
2025-05-07T20:32:17.8455045Z     
2025-05-07T20:32:17.8455226Z         if contiguous:
2025-05-07T20:32:17.8455457Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8455709Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8455940Z     
2025-05-07T20:32:17.8456133Z         if scale_ub is not None:
2025-05-07T20:32:17.8456403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8456729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8457035Z             )
2025-05-07T20:32:17.8457216Z         else:
2025-05-07T20:32:17.8457417Z             scale_ub_tensor = None
2025-05-07T20:32:17.8457661Z     
2025-05-07T20:32:17.8457888Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8458191Z             op = silu_mul_quant
2025-05-07T20:32:17.8458438Z             if compiled:
2025-05-07T20:32:17.8458686Z                 op = torch.compile(op)
2025-05-07T20:32:17.8459057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8459328Z     
2025-05-07T20:32:17.8459518Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.8459682Z 
2025-05-07T20:32:17.8459783Z moe/activation_test.py:117: 
2025-05-07T20:32:17.8460066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8460399Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.8460677Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8461231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.8461790Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.8462449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.8463136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.8463715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8464445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8465109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8465635Z     kernel = self.compile(
2025-05-07T20:32:17.8466182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8466843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8467240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8467465Z 
2025-05-07T20:32:17.8467672Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2225110>
2025-05-07T20:32:17.8468753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8470163Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2d696c0>}
2025-05-07T20:32:17.8471508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8472532Z context = <triton._C.libtriton.ir.context object at 0x7f18a2229730>
2025-05-07T20:32:17.8472815Z 
2025-05-07T20:32:17.8472981Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8473501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8473972Z                            module_map=module_map)
2025-05-07T20:32:17.8474331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8474695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.8474953Z E       ^
2025-05-07T20:32:17.8475410Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8475864Z 
2025-05-07T20:32:17.8476280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8476790Z 
2025-05-07T20:32:17.8476897Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8477309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8477699Z     T=128,
2025-05-07T20:32:17.8477877Z     D=7168,
2025-05-07T20:32:17.8478065Z     scale_ub=1200.0,
2025-05-07T20:32:17.8478282Z     contiguous=False,
2025-05-07T20:32:17.8478499Z     compiled=False,
2025-05-07T20:32:17.8478699Z )
2025-05-07T20:32:17.9344933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9345571Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9345863Z 
2025-05-07T20:32:17.9345940Z     @given(
2025-05-07T20:32:17.9346169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9346472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9346774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9347095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9347416Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9347696Z     )
2025-05-07T20:32:17.9348046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9348480Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9348715Z         self,
2025-05-07T20:32:17.9348910Z         T: int,
2025-05-07T20:32:17.9349104Z         D: int,
2025-05-07T20:32:17.9349384Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9349652Z         contiguous: bool,
2025-05-07T20:32:17.9349951Z         compiled: bool,
2025-05-07T20:32:17.9350168Z     ) -> None:
2025-05-07T20:32:17.9350383Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9350620Z     
2025-05-07T20:32:17.9350885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9351225Z     
2025-05-07T20:32:17.9351410Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9351694Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9351992Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9352231Z         x0 = x[:, :D]
2025-05-07T20:32:17.9352445Z         x1 = x[:, D:]
2025-05-07T20:32:17.9352642Z     
2025-05-07T20:32:17.9352823Z         if contiguous:
2025-05-07T20:32:17.9353055Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9353310Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9353550Z     
2025-05-07T20:32:17.9353739Z         if scale_ub is not None:
2025-05-07T20:32:17.9354003Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9354339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9354717Z             )
2025-05-07T20:32:17.9354906Z         else:
2025-05-07T20:32:17.9355111Z             scale_ub_tensor = None
2025-05-07T20:32:17.9355359Z     
2025-05-07T20:32:17.9355584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9355887Z             op = silu_mul_quant
2025-05-07T20:32:17.9356133Z             if compiled:
2025-05-07T20:32:17.9356379Z                 op = torch.compile(op)
2025-05-07T20:32:17.9356662Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9356931Z     
2025-05-07T20:32:17.9357119Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9357280Z 
2025-05-07T20:32:17.9357378Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9357668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9357991Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9358267Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9358967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9359655Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9360192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9360868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9361533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9362068Z     kernel = self.compile(
2025-05-07T20:32:17.9362604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9363350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9363742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9364018Z 
2025-05-07T20:32:17.9364231Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a223d210>
2025-05-07T20:32:17.9365302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9366660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2a7cae0>}
2025-05-07T20:32:17.9368001Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9369017Z context = <triton._C.libtriton.ir.context object at 0x7f18a2259870>
2025-05-07T20:32:17.9369341Z 
2025-05-07T20:32:17.9369514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9370071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9370532Z                            module_map=module_map)
2025-05-07T20:32:17.9370893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9371239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9371496Z E       ^
2025-05-07T20:32:17.9371958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9372405Z 
2025-05-07T20:32:17.9372849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9373362Z 
2025-05-07T20:32:17.9373469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9373873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9374274Z     T=128,
2025-05-07T20:32:17.9374508Z     D=5120,
2025-05-07T20:32:17.9374695Z     scale_ub=None,
2025-05-07T20:32:17.9374903Z     contiguous=False,
2025-05-07T20:32:17.9375128Z     compiled=False,
2025-05-07T20:32:17.9375327Z )
2025-05-07T20:32:17.9375641Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9376126Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9376393Z 
2025-05-07T20:32:17.9376474Z     @given(
2025-05-07T20:32:17.9376691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9376999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9377298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9377618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9377938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9378218Z     )
2025-05-07T20:32:17.9378562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9379011Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9379247Z         self,
2025-05-07T20:32:17.9379433Z         T: int,
2025-05-07T20:32:17.9379620Z         D: int,
2025-05-07T20:32:17.9379831Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9380097Z         contiguous: bool,
2025-05-07T20:32:17.9380326Z         compiled: bool,
2025-05-07T20:32:17.9380558Z     ) -> None:
2025-05-07T20:32:17.9380758Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9380997Z     
2025-05-07T20:32:17.9381271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9381601Z     
2025-05-07T20:32:17.9381792Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9382078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9382371Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9382607Z         x0 = x[:, :D]
2025-05-07T20:32:17.9382824Z         x1 = x[:, D:]
2025-05-07T20:32:17.9383030Z     
2025-05-07T20:32:17.9383262Z         if contiguous:
2025-05-07T20:32:17.9383495Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9383752Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9383986Z     
2025-05-07T20:32:17.9384174Z         if scale_ub is not None:
2025-05-07T20:32:17.9384445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9384772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9385076Z             )
2025-05-07T20:32:17.9385266Z         else:
2025-05-07T20:32:17.9385467Z             scale_ub_tensor = None
2025-05-07T20:32:17.9385713Z     
2025-05-07T20:32:17.9385943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9386242Z             op = silu_mul_quant
2025-05-07T20:32:17.9386486Z             if compiled:
2025-05-07T20:32:17.9386727Z                 op = torch.compile(op)
2025-05-07T20:32:17.9387011Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9387430Z     
2025-05-07T20:32:17.9387622Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9387829Z 
2025-05-07T20:32:17.9387931Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9388216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9388541Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9388816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9389548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9390238Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9390773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9391454Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9392113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9392642Z     kernel = self.compile(
2025-05-07T20:32:17.9393229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9393884Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9394272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9394501Z 
2025-05-07T20:32:17.9394704Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a21f0610>
2025-05-07T20:32:17.9395780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9397139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a212c720>}
2025-05-07T20:32:17.9398480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9399511Z context = <triton._C.libtriton.ir.context object at 0x7f18a21083b0>
2025-05-07T20:32:17.9399796Z 
2025-05-07T20:32:17.9399970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9400487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9400946Z                            module_map=module_map)
2025-05-07T20:32:17.9401307Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9401661Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9401912Z E       ^
2025-05-07T20:32:17.9402369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9402822Z 
2025-05-07T20:32:17.9403416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9403939Z 
2025-05-07T20:32:17.9404043Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9404448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9404846Z     T=128,
2025-05-07T20:32:17.9405028Z     D=5120,
2025-05-07T20:32:17.9405212Z     scale_ub=1200.0,
2025-05-07T20:32:17.9405437Z     contiguous=True,
2025-05-07T20:32:17.9405657Z     compiled=False,
2025-05-07T20:32:17.9405852Z )
2025-05-07T20:32:18.2485289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.2485812Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.2486113Z 
2025-05-07T20:32:18.2486205Z     @given(
2025-05-07T20:32:18.2486436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.2486880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.2487257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.2487595Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.2487919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.2488205Z     )
2025-05-07T20:32:18.2488555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.2488991Z     def test_silu_mul_quant(
2025-05-07T20:32:18.2489226Z         self,
2025-05-07T20:32:18.2489425Z         T: int,
2025-05-07T20:32:18.2489619Z         D: int,
2025-05-07T20:32:18.2489838Z         scale_ub: Optional[float],
2025-05-07T20:32:18.2490109Z         contiguous: bool,
2025-05-07T20:32:18.2490340Z         compiled: bool,
2025-05-07T20:32:18.2490565Z     ) -> None:
2025-05-07T20:32:18.2490778Z         torch.manual_seed(2025)
2025-05-07T20:32:18.2491017Z     
2025-05-07T20:32:18.2491284Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.2491625Z     
2025-05-07T20:32:18.2491825Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.2492186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.2492501Z         x = x_sign * x_clamp
2025-05-07T20:32:18.2492740Z         x0 = x[:, :D]
2025-05-07T20:32:18.2492951Z         x1 = x[:, D:]
2025-05-07T20:32:18.2493157Z     
2025-05-07T20:32:18.2493344Z         if contiguous:
2025-05-07T20:32:18.2493572Z             x0 = x0.contiguous()
2025-05-07T20:32:18.2493829Z             x1 = x1.contiguous()
2025-05-07T20:32:18.2494069Z     
2025-05-07T20:32:18.2494256Z         if scale_ub is not None:
2025-05-07T20:32:18.2494527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.2494863Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.2495169Z             )
2025-05-07T20:32:18.2495357Z         else:
2025-05-07T20:32:18.2495564Z             scale_ub_tensor = None
2025-05-07T20:32:18.2495809Z     
2025-05-07T20:32:18.2496043Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.2496365Z             op = silu_mul_quant
2025-05-07T20:32:18.2496616Z             if compiled:
2025-05-07T20:32:18.2496863Z                 op = torch.compile(op)
2025-05-07T20:32:18.2497158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.2497427Z     
2025-05-07T20:32:18.2497614Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.2497782Z 
2025-05-07T20:32:18.2497881Z moe/activation_test.py:117: 
2025-05-07T20:32:18.2498174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.2498496Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.2498776Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.2499467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.2500153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.2500757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.2501457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.2502120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.2502649Z     kernel = self.compile(
2025-05-07T20:32:18.2503191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.2503851Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.2504242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.2504468Z 
2025-05-07T20:32:18.2504675Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a21e75d0>
2025-05-07T20:32:18.2505794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.2507188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a212d8a0>}
2025-05-07T20:32:18.2508522Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.2509550Z context = <triton._C.libtriton.ir.context object at 0x7f18a21afc30>
2025-05-07T20:32:18.2509838Z 
2025-05-07T20:32:18.2510004Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.2510522Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.2510990Z                            module_map=module_map)
2025-05-07T20:32:18.2511353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.2511753Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.2512011Z E       ^
2025-05-07T20:32:18.2512473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.2512923Z 
2025-05-07T20:32:18.2513342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.2513864Z 
2025-05-07T20:32:18.2513966Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.2514378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.2514775Z     T=1,
2025-05-07T20:32:18.2514957Z     D=7168,
2025-05-07T20:32:18.2515145Z     scale_ub=1200.0,
2025-05-07T20:32:18.2515362Z     contiguous=True,
2025-05-07T20:32:18.2515574Z     compiled=True,
2025-05-07T20:32:18.2515775Z )
2025-05-07T20:32:18.2516092Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.2516580Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:18.2516844Z 
2025-05-07T20:32:18.2516922Z     @given(
2025-05-07T20:32:18.2517151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.2517452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.2517777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.2518102Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.2518425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.2518701Z     )
2025-05-07T20:32:18.2519049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.2519538Z     def test_silu_mul_quant(
2025-05-07T20:32:18.2519777Z         self,
2025-05-07T20:32:18.2519967Z         T: int,
2025-05-07T20:32:18.2520165Z         D: int,
2025-05-07T20:32:18.2520380Z         scale_ub: Optional[float],
2025-05-07T20:32:18.2520647Z         contiguous: bool,
2025-05-07T20:32:18.2520936Z         compiled: bool,
2025-05-07T20:32:18.2521161Z     ) -> None:
2025-05-07T20:32:18.2521368Z         torch.manual_seed(2025)
2025-05-07T20:32:18.2521611Z     
2025-05-07T20:32:18.2521882Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.2522217Z     
2025-05-07T20:32:18.2522410Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.2522699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.2523001Z         x = x_sign * x_clamp
2025-05-07T20:32:18.2523356Z         x0 = x[:, :D]
2025-05-07T20:32:18.2523569Z         x1 = x[:, D:]
2025-05-07T20:32:18.2523772Z     
2025-05-07T20:32:18.2523953Z         if contiguous:
2025-05-07T20:32:18.2524182Z             x0 = x0.contiguous()
2025-05-07T20:32:18.2524430Z             x1 = x1.contiguous()
2025-05-07T20:32:18.2524672Z     
2025-05-07T20:32:18.2530824Z         if scale_ub is not None:
2025-05-07T20:32:18.2531198Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.2531574Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.2531890Z             )
2025-05-07T20:32:18.2532083Z         else:
2025-05-07T20:32:18.2532290Z             scale_ub_tensor = None
2025-05-07T20:32:18.2532544Z     
2025-05-07T20:32:18.2532780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.2533094Z             op = silu_mul_quant
2025-05-07T20:32:18.2533342Z             if compiled:
2025-05-07T20:32:18.2533585Z                 op = torch.compile(op)
2025-05-07T20:32:18.2533885Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.2534152Z     
2025-05-07T20:32:18.2534344Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.2534508Z 
2025-05-07T20:32:18.2534613Z moe/activation_test.py:117: 
2025-05-07T20:32:18.2534902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.2535238Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.2535520Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.2536131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.2536695Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.2537353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.2538041Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.2538819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.2539505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.2540171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.2540709Z     kernel = self.compile(
2025-05-07T20:32:18.2541252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.2541922Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.2542319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.2542545Z 
2025-05-07T20:32:18.2542752Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2182a90>
2025-05-07T20:32:18.2543831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.2545196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a212ee80>}
2025-05-07T20:32:18.2546621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.2547653Z context = <triton._C.libtriton.ir.context object at 0x7f18a2117130>
2025-05-07T20:32:18.2547940Z 
2025-05-07T20:32:18.2548106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.2548626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.2549092Z                            module_map=module_map)
2025-05-07T20:32:18.2549506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.2549859Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.2550120Z E       ^
2025-05-07T20:32:18.2550586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.2551039Z 
2025-05-07T20:32:18.2551520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.2552090Z 
2025-05-07T20:32:18.2552199Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.2552609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.2553003Z     T=1,
2025-05-07T20:32:18.2553178Z     D=7168,
2025-05-07T20:32:18.2553374Z     scale_ub=1200.0,
2025-05-07T20:32:18.2553597Z     contiguous=False,
2025-05-07T20:32:18.2553812Z     compiled=True,
2025-05-07T20:32:18.2554017Z )
2025-05-07T20:32:18.3570391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.3570921Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:18.3571185Z 
2025-05-07T20:32:18.3571266Z     @given(
2025-05-07T20:32:18.3571496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.3571808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.3572113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.3572447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.3572898Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.3573187Z     )
2025-05-07T20:32:18.3573532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.3573977Z     def test_silu_mul_quant(
2025-05-07T20:32:18.3574220Z         self,
2025-05-07T20:32:18.3574411Z         T: int,
2025-05-07T20:32:18.3574614Z         D: int,
2025-05-07T20:32:18.3574829Z         scale_ub: Optional[float],
2025-05-07T20:32:18.3575096Z         contiguous: bool,
2025-05-07T20:32:18.3575334Z         compiled: bool,
2025-05-07T20:32:18.3575558Z     ) -> None:
2025-05-07T20:32:18.3575765Z         torch.manual_seed(2025)
2025-05-07T20:32:18.3576006Z     
2025-05-07T20:32:18.3576286Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.3576628Z     
2025-05-07T20:32:18.3576817Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.3577113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.3577427Z         x = x_sign * x_clamp
2025-05-07T20:32:18.3577660Z         x0 = x[:, :D]
2025-05-07T20:32:18.3577876Z         x1 = x[:, D:]
2025-05-07T20:32:18.3578078Z     
2025-05-07T20:32:18.3578263Z         if contiguous:
2025-05-07T20:32:18.3578493Z             x0 = x0.contiguous()
2025-05-07T20:32:18.3578742Z             x1 = x1.contiguous()
2025-05-07T20:32:18.3578982Z     
2025-05-07T20:32:18.3579168Z         if scale_ub is not None:
2025-05-07T20:32:18.3579433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.3579763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.3580069Z             )
2025-05-07T20:32:18.3580252Z         else:
2025-05-07T20:32:18.3580463Z             scale_ub_tensor = None
2025-05-07T20:32:18.3580714Z     
2025-05-07T20:32:18.3580947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.3581254Z             op = silu_mul_quant
2025-05-07T20:32:18.3581505Z             if compiled:
2025-05-07T20:32:18.3581821Z                 op = torch.compile(op)
2025-05-07T20:32:18.3582118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.3582388Z     
2025-05-07T20:32:18.3582573Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.3582734Z 
2025-05-07T20:32:18.3582831Z moe/activation_test.py:117: 
2025-05-07T20:32:18.3583123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.3583453Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.3583729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.3584286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.3584845Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.3585506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.3586254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.3586790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.3587528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.3588191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.3588719Z     kernel = self.compile(
2025-05-07T20:32:18.3589285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.3589966Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.3590354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.3590583Z 
2025-05-07T20:32:18.3590787Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1fb89d0>
2025-05-07T20:32:18.3591854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.3593257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1fc8680>}
2025-05-07T20:32:18.3594589Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.3595605Z context = <triton._C.libtriton.ir.context object at 0x7f16f1f5d030>
2025-05-07T20:32:18.3595897Z 
2025-05-07T20:32:18.3596065Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.3596580Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.3597045Z                            module_map=module_map)
2025-05-07T20:32:18.3597408Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.3597763Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.3598023Z E       ^
2025-05-07T20:32:18.3598477Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.3598929Z 
2025-05-07T20:32:18.3599349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.3599867Z 
2025-05-07T20:32:18.3599970Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.3600384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.3600778Z     T=1,
2025-05-07T20:32:18.3600960Z     D=7168,
2025-05-07T20:32:18.3601147Z     scale_ub=None,
2025-05-07T20:32:18.3601358Z     contiguous=False,
2025-05-07T20:32:18.3601582Z     compiled=True,
2025-05-07T20:32:18.3601784Z )
2025-05-07T20:32:18.4279954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.4280519Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:18.4280788Z 
2025-05-07T20:32:18.4280866Z     @given(
2025-05-07T20:32:18.4281098Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.4281402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.4281707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.4282040Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.4282362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.4282642Z     )
2025-05-07T20:32:18.4282989Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.4283517Z     def test_silu_mul_quant(
2025-05-07T20:32:18.4283752Z         self,
2025-05-07T20:32:18.4283945Z         T: int,
2025-05-07T20:32:18.4284206Z         D: int,
2025-05-07T20:32:18.4284416Z         scale_ub: Optional[float],
2025-05-07T20:32:18.4284740Z         contiguous: bool,
2025-05-07T20:32:18.4284976Z         compiled: bool,
2025-05-07T20:32:18.4285195Z     ) -> None:
2025-05-07T20:32:18.4285405Z         torch.manual_seed(2025)
2025-05-07T20:32:18.4285641Z     
2025-05-07T20:32:18.4285906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.4286245Z     
2025-05-07T20:32:18.4286438Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.4286730Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.4287032Z         x = x_sign * x_clamp
2025-05-07T20:32:18.4287268Z         x0 = x[:, :D]
2025-05-07T20:32:18.4287484Z         x1 = x[:, D:]
2025-05-07T20:32:18.4287683Z     
2025-05-07T20:32:18.4287868Z         if contiguous:
2025-05-07T20:32:18.4288099Z             x0 = x0.contiguous()
2025-05-07T20:32:18.4288350Z             x1 = x1.contiguous()
2025-05-07T20:32:18.4288585Z     
2025-05-07T20:32:18.4288776Z         if scale_ub is not None:
2025-05-07T20:32:18.4289050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.4289491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.4289801Z             )
2025-05-07T20:32:18.4289990Z         else:
2025-05-07T20:32:18.4290197Z             scale_ub_tensor = None
2025-05-07T20:32:18.4290447Z     
2025-05-07T20:32:18.4290674Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.4290983Z             op = silu_mul_quant
2025-05-07T20:32:18.4291224Z             if compiled:
2025-05-07T20:32:18.4291465Z                 op = torch.compile(op)
2025-05-07T20:32:18.4291752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.4292020Z     
2025-05-07T20:32:18.4292213Z         y_fp8, y_scale = fn()
2025-05-07T20:32:18.4292492Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:18.4292777Z     
2025-05-07T20:32:18.4293016Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.4293342Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:18.4293641Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:18.4293952Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:18.4294308Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.4294616Z     
2025-05-07T20:32:18.4294824Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:18.4295014Z 
2025-05-07T20:32:18.4295120Z moe/activation_test.py:126: 
2025-05-07T20:32:18.4295417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4295753Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:18.4296070Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.4296861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:18.4297612Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:18.4298208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.4298896Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.4299588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:18.4300321Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.4301073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:18.4301815Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.4302548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:18.4303232Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:18.4303833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:18.4304391Z     fn()
2025-05-07T20:32:18.4304901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:18.4305479Z     self.fn.run(
2025-05-07T20:32:18.4305939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.4306467Z     kernel = self.compile(
2025-05-07T20:32:18.4307008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.4307657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.4308049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4308278Z 
2025-05-07T20:32:18.4308488Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1f87090>
2025-05-07T20:32:18.4309561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.4310969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1fc9580>}
2025-05-07T20:32:18.4312300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.4313336Z context = <triton._C.libtriton.ir.context object at 0x7f16f1f8b6b0>
2025-05-07T20:32:18.4315017Z 
2025-05-07T20:32:18.4315187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.4315711Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.4316178Z                            module_map=module_map)
2025-05-07T20:32:18.4316540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.4316895Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:18.4317154Z E       ^
2025-05-07T20:32:18.4317615Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.4318070Z 
2025-05-07T20:32:18.4318487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.4319000Z 
2025-05-07T20:32:18.4319107Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.4319560Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.4319954Z     T=1,
2025-05-07T20:32:18.4320131Z     D=5120,
2025-05-07T20:32:18.4320322Z     scale_ub=1200.0,
2025-05-07T20:32:18.4320538Z     contiguous=False,
2025-05-07T20:32:18.4320809Z     compiled=True,
2025-05-07T20:32:18.4321013Z )
2025-05-07T20:32:18.5529942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5530467Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:18.5530736Z 
2025-05-07T20:32:18.5530816Z     @given(
2025-05-07T20:32:18.5531055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5531373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5531674Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5532003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5532332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5532609Z     )
2025-05-07T20:32:18.5532956Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5533396Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5533731Z         self,
2025-05-07T20:32:18.5533923Z         T: int,
2025-05-07T20:32:18.5534184Z         D: int,
2025-05-07T20:32:18.5534403Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5534667Z         contiguous: bool,
2025-05-07T20:32:18.5534904Z         compiled: bool,
2025-05-07T20:32:18.5535134Z     ) -> None:
2025-05-07T20:32:18.5535344Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5535584Z     
2025-05-07T20:32:18.5535861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5536198Z     
2025-05-07T20:32:18.5536398Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.5536693Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.5536998Z         x = x_sign * x_clamp
2025-05-07T20:32:18.5537232Z         x0 = x[:, :D]
2025-05-07T20:32:18.5537449Z         x1 = x[:, D:]
2025-05-07T20:32:18.5537652Z     
2025-05-07T20:32:18.5537847Z         if contiguous:
2025-05-07T20:32:18.5538085Z             x0 = x0.contiguous()
2025-05-07T20:32:18.5538346Z             x1 = x1.contiguous()
2025-05-07T20:32:18.5538730Z     
2025-05-07T20:32:18.5539025Z         if scale_ub is not None:
2025-05-07T20:32:18.5539296Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.5539629Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.5539934Z             )
2025-05-07T20:32:18.5540126Z         else:
2025-05-07T20:32:18.5540336Z             scale_ub_tensor = None
2025-05-07T20:32:18.5540588Z     
2025-05-07T20:32:18.5540815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.5541126Z             op = silu_mul_quant
2025-05-07T20:32:18.5541376Z             if compiled:
2025-05-07T20:32:18.5541618Z                 op = torch.compile(op)
2025-05-07T20:32:18.5541912Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.5542188Z     
2025-05-07T20:32:18.5542378Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.5542547Z 
2025-05-07T20:32:18.5542652Z moe/activation_test.py:117: 
2025-05-07T20:32:18.5542948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.5543282Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.5543560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.5544118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.5544682Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.5545335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.5546025Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.5546560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.5547233Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.5547898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.5548504Z     kernel = self.compile(
2025-05-07T20:32:18.5549053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.5549761Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.5550154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.5550380Z 
2025-05-07T20:32:18.5550590Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a23f7650>
2025-05-07T20:32:18.5551665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.5553076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1fcab60>}
2025-05-07T20:32:18.5554467Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.5555495Z context = <triton._C.libtriton.ir.context object at 0x7f18a23d80b0>
2025-05-07T20:32:18.5555779Z 
2025-05-07T20:32:18.5555950Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.5556466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.5556933Z                            module_map=module_map)
2025-05-07T20:32:18.5557296Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.5557655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.5557910Z E       ^
2025-05-07T20:32:18.5558374Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.5558828Z 
2025-05-07T20:32:18.5559324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.5559861Z 
2025-05-07T20:32:18.5559967Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5560372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5560769Z     T=1,
2025-05-07T20:32:18.5560951Z     D=5120,
2025-05-07T20:32:18.5561134Z     scale_ub=1200.0,
2025-05-07T20:32:18.5561356Z     contiguous=False,
2025-05-07T20:32:18.5561582Z     compiled=False,
2025-05-07T20:32:18.5561780Z )
2025-05-07T20:32:18.5562097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5562586Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:18.5562850Z 
2025-05-07T20:32:18.5562928Z     @given(
2025-05-07T20:32:18.5563158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5563600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5563911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5564233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5564566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5564851Z     )
2025-05-07T20:32:18.5565196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5565633Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5565873Z         self,
2025-05-07T20:32:18.5566062Z         T: int,
2025-05-07T20:32:18.5566256Z         D: int,
2025-05-07T20:32:18.5566474Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5566739Z         contiguous: bool,
2025-05-07T20:32:18.5566972Z         compiled: bool,
2025-05-07T20:32:18.5567189Z     ) -> None:
2025-05-07T20:32:18.5567394Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5567634Z     
2025-05-07T20:32:18.5567908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5568294Z     
2025-05-07T20:32:18.5568486Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.5568778Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.5569083Z         x = x_sign * x_clamp
2025-05-07T20:32:18.5569315Z         x0 = x[:, :D]
2025-05-07T20:32:18.5569550Z         x1 = x[:, D:]
2025-05-07T20:32:18.5569780Z     
2025-05-07T20:32:18.5569960Z         if contiguous:
2025-05-07T20:32:18.5570187Z             x0 = x0.contiguous()
2025-05-07T20:32:18.5570444Z             x1 = x1.contiguous()
2025-05-07T20:32:18.5570678Z     
2025-05-07T20:32:18.5570867Z         if scale_ub is not None:
2025-05-07T20:32:18.5571138Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.5571466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.5571767Z             )
2025-05-07T20:32:18.5571963Z         else:
2025-05-07T20:32:18.5572167Z             scale_ub_tensor = None
2025-05-07T20:32:18.5572483Z     
2025-05-07T20:32:18.5572722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.5573106Z             op = silu_mul_quant
2025-05-07T20:32:18.5573347Z             if compiled:
2025-05-07T20:32:18.5573595Z                 op = torch.compile(op)
2025-05-07T20:32:18.5573890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.5574158Z     
2025-05-07T20:32:18.5574351Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.5574514Z 
2025-05-07T20:32:18.5574620Z moe/activation_test.py:117: 
2025-05-07T20:32:18.5574908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.5575238Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.5575523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.5576212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.5576907Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.5583174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.5584406Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.5585075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.5585613Z     kernel = self.compile(
2025-05-07T20:32:18.5586155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.5586810Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.5587206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.5587430Z 
2025-05-07T20:32:18.5587638Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2361390>
2025-05-07T20:32:18.5588708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.5590123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1fcb2e0>}
2025-05-07T20:32:18.5591454Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.5592479Z context = <triton._C.libtriton.ir.context object at 0x7f18a23c99f0>
2025-05-07T20:32:18.5592761Z 
2025-05-07T20:32:18.5592928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.5593451Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.5593918Z                            module_map=module_map)
2025-05-07T20:32:18.5594336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.5594694Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.5594947Z E       ^
2025-05-07T20:32:18.5595407Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.5595853Z 
2025-05-07T20:32:18.5596269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.5596785Z 
2025-05-07T20:32:18.5596889Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5597293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5597689Z     T=16384,
2025-05-07T20:32:18.5597875Z     D=5120,
2025-05-07T20:32:18.5598065Z     scale_ub=1200.0,
2025-05-07T20:32:18.5598286Z     contiguous=False,
2025-05-07T20:32:18.5598501Z     compiled=True,
2025-05-07T20:32:18.5598700Z )
2025-05-07T20:32:18.8040880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.8041506Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:18.8041788Z 
2025-05-07T20:32:18.8041872Z     @given(
2025-05-07T20:32:18.8042097Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.8042412Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.8042715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.8043037Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.8043454Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.8043742Z     )
2025-05-07T20:32:18.8044089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.8044524Z     def test_silu_mul_quant(
2025-05-07T20:32:18.8044766Z         self,
2025-05-07T20:32:18.8044960Z         T: int,
2025-05-07T20:32:18.8045156Z         D: int,
2025-05-07T20:32:18.8045379Z         scale_ub: Optional[float],
2025-05-07T20:32:18.8045655Z         contiguous: bool,
2025-05-07T20:32:18.8045967Z         compiled: bool,
2025-05-07T20:32:18.8046196Z     ) -> None:
2025-05-07T20:32:18.8046420Z         torch.manual_seed(2025)
2025-05-07T20:32:18.8046661Z     
2025-05-07T20:32:18.8046935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.8047272Z     
2025-05-07T20:32:18.8047466Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.8047746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.8048048Z         x = x_sign * x_clamp
2025-05-07T20:32:18.8048281Z         x0 = x[:, :D]
2025-05-07T20:32:18.8048491Z         x1 = x[:, D:]
2025-05-07T20:32:18.8048703Z     
2025-05-07T20:32:18.8048895Z         if contiguous:
2025-05-07T20:32:18.8049124Z             x0 = x0.contiguous()
2025-05-07T20:32:18.8049381Z             x1 = x1.contiguous()
2025-05-07T20:32:18.8049628Z     
2025-05-07T20:32:18.8049820Z         if scale_ub is not None:
2025-05-07T20:32:18.8050093Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.8050436Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.8050743Z             )
2025-05-07T20:32:18.8050936Z         else:
2025-05-07T20:32:18.8051145Z             scale_ub_tensor = None
2025-05-07T20:32:18.8051388Z     
2025-05-07T20:32:18.8051619Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.8051931Z             op = silu_mul_quant
2025-05-07T20:32:18.8052175Z             if compiled:
2025-05-07T20:32:18.8052426Z                 op = torch.compile(op)
2025-05-07T20:32:18.8052721Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.8052988Z     
2025-05-07T20:32:18.8053184Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.8053345Z 
2025-05-07T20:32:18.8053448Z moe/activation_test.py:117: 
2025-05-07T20:32:18.8053733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.8054062Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.8054417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.8054978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.8055540Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.8056195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.8056884Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.8057413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.8058088Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.8058753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.8059275Z     kernel = self.compile(
2025-05-07T20:32:18.8059864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.8060564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.8060961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.8061187Z 
2025-05-07T20:32:18.8061392Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1ba4350>
2025-05-07T20:32:18.8062462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.8063824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1b40fe0>}
2025-05-07T20:32:18.8065166Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.8066237Z context = <triton._C.libtriton.ir.context object at 0x7f16f1b709b0>
2025-05-07T20:32:18.8066521Z 
2025-05-07T20:32:18.8066689Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.8067208Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.8067673Z                            module_map=module_map)
2025-05-07T20:32:18.8068037Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.8068390Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.8068647Z E       ^
2025-05-07T20:32:18.8069110Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.8069562Z 
2025-05-07T20:32:18.8069986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.8070512Z 
2025-05-07T20:32:18.8070616Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.8071027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.8071426Z     T=2048,
2025-05-07T20:32:18.8071616Z     D=7168,
2025-05-07T20:32:18.8071809Z     scale_ub=1200.0,
2025-05-07T20:32:18.8072021Z     contiguous=False,
2025-05-07T20:32:18.8072238Z     compiled=True,
2025-05-07T20:32:18.8072438Z )
2025-05-07T20:32:18.8072763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.8073254Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:18.8073535Z 
2025-05-07T20:32:18.8073611Z     @given(
2025-05-07T20:32:18.8073831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.8074134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.8074439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.8074813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.8075134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.8075417Z     )
2025-05-07T20:32:18.8075761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.8076200Z     def test_silu_mul_quant(
2025-05-07T20:32:18.8076431Z         self,
2025-05-07T20:32:18.8076623Z         T: int,
2025-05-07T20:32:18.8076816Z         D: int,
2025-05-07T20:32:18.8077025Z         scale_ub: Optional[float],
2025-05-07T20:32:18.8077291Z         contiguous: bool,
2025-05-07T20:32:18.8077527Z         compiled: bool,
2025-05-07T20:32:18.8077743Z     ) -> None:
2025-05-07T20:32:18.8077956Z         torch.manual_seed(2025)
2025-05-07T20:32:18.8078196Z     
2025-05-07T20:32:18.8078463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.8078805Z     
2025-05-07T20:32:18.8079047Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.8079362Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.8079729Z         x = x_sign * x_clamp
2025-05-07T20:32:18.8079969Z         x0 = x[:, :D]
2025-05-07T20:32:18.8080178Z         x1 = x[:, D:]
2025-05-07T20:32:18.8080385Z     
2025-05-07T20:32:18.8080569Z         if contiguous:
2025-05-07T20:32:18.8080792Z             x0 = x0.contiguous()
2025-05-07T20:32:18.8081045Z             x1 = x1.contiguous()
2025-05-07T20:32:18.8081284Z     
2025-05-07T20:32:18.8081470Z         if scale_ub is not None:
2025-05-07T20:32:18.8081732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.8082065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.8082368Z             )
2025-05-07T20:32:18.8082553Z         else:
2025-05-07T20:32:18.8082760Z             scale_ub_tensor = None
2025-05-07T20:32:18.8083011Z     
2025-05-07T20:32:18.8083303Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.8083619Z             op = silu_mul_quant
2025-05-07T20:32:18.8083874Z             if compiled:
2025-05-07T20:32:18.8084185Z                 op = torch.compile(op)
2025-05-07T20:32:18.8084485Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.8084757Z     
2025-05-07T20:32:18.8084940Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.8085109Z 
2025-05-07T20:32:18.8085208Z moe/activation_test.py:117: 
2025-05-07T20:32:18.8085499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.8085831Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.8086105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.8086664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.8087223Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.8087881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.8088574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.8089118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.8089803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.8090461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.8090998Z     kernel = self.compile(
2025-05-07T20:32:18.8091542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.8092196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.8092589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.8092822Z 
2025-05-07T20:32:18.8093029Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1b05910>
2025-05-07T20:32:18.8094153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.8095516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1b41b20>}
2025-05-07T20:32:18.8096852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.8097877Z context = <triton._C.libtriton.ir.context object at 0x7f16f1bd5f30>
2025-05-07T20:32:18.8098166Z 
2025-05-07T20:32:18.8098332Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.8098923Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.8099425Z                            module_map=module_map)
2025-05-07T20:32:18.8099791Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.8100146Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.8100398Z E       ^
2025-05-07T20:32:18.8100858Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.8101315Z 
2025-05-07T20:32:18.8101742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.8102255Z 
2025-05-07T20:32:18.9000936Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.9001400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.9001809Z     T=1,
2025-05-07T20:32:18.9001995Z     D=5120,
2025-05-07T20:32:18.9002182Z     scale_ub=None,
2025-05-07T20:32:18.9002395Z     contiguous=False,
2025-05-07T20:32:18.9002635Z     compiled=False,
2025-05-07T20:32:18.9002834Z )
2025-05-07T20:32:18.9003346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.9003838Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.9004097Z 
2025-05-07T20:32:18.9004170Z     @given(
2025-05-07T20:32:18.9004397Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.9004709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.9005044Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.9005370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.9005695Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.9005978Z     )
2025-05-07T20:32:18.9006322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.9006759Z     def test_silu_mul_quant(
2025-05-07T20:32:18.9006990Z         self,
2025-05-07T20:32:18.9007184Z         T: int,
2025-05-07T20:32:18.9007377Z         D: int,
2025-05-07T20:32:18.9007590Z         scale_ub: Optional[float],
2025-05-07T20:32:18.9007861Z         contiguous: bool,
2025-05-07T20:32:18.9008095Z         compiled: bool,
2025-05-07T20:32:18.9008316Z     ) -> None:
2025-05-07T20:32:18.9008531Z         torch.manual_seed(2025)
2025-05-07T20:32:18.9008765Z     
2025-05-07T20:32:18.9009029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.9009368Z     
2025-05-07T20:32:18.9009559Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.9009842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.9010180Z         x = x_sign * x_clamp
2025-05-07T20:32:18.9010433Z         x0 = x[:, :D]
2025-05-07T20:32:18.9010640Z         x1 = x[:, D:]
2025-05-07T20:32:18.9010844Z     
2025-05-07T20:32:18.9011026Z         if contiguous:
2025-05-07T20:32:18.9011252Z             x0 = x0.contiguous()
2025-05-07T20:32:18.9011500Z             x1 = x1.contiguous()
2025-05-07T20:32:18.9011737Z     
2025-05-07T20:32:18.9011993Z         if scale_ub is not None:
2025-05-07T20:32:18.9012270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.9012596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.9012904Z             )
2025-05-07T20:32:18.9013088Z         else:
2025-05-07T20:32:18.9013303Z             scale_ub_tensor = None
2025-05-07T20:32:18.9013551Z     
2025-05-07T20:32:18.9013775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.9014084Z             op = silu_mul_quant
2025-05-07T20:32:18.9014326Z             if compiled:
2025-05-07T20:32:18.9014568Z                 op = torch.compile(op)
2025-05-07T20:32:18.9014857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.9015129Z     
2025-05-07T20:32:18.9015313Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.9015484Z 
2025-05-07T20:32:18.9015581Z moe/activation_test.py:117: 
2025-05-07T20:32:18.9015937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.9016329Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.9016609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.9017299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.9017990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.9018518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.9019201Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.9019918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.9020450Z     kernel = self.compile(
2025-05-07T20:32:18.9020988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.9021652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.9022093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.9022317Z 
2025-05-07T20:32:18.9022525Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a32031d0>
2025-05-07T20:32:18.9023588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.9024948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1b42e80>}
2025-05-07T20:32:18.9026295Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.9027323Z context = <triton._C.libtriton.ir.context object at 0x7f18a320b830>
2025-05-07T20:32:18.9027613Z 
2025-05-07T20:32:18.9027777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.9028299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.9028763Z                            module_map=module_map)
2025-05-07T20:32:18.9029134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.9029483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.9029745Z E       ^
2025-05-07T20:32:18.9030254Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.9030705Z 
2025-05-07T20:32:18.9031129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.9031643Z 
2025-05-07T20:32:18.9031751Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.9032282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.9032684Z     T=4096,
2025-05-07T20:32:18.9032866Z     D=7168,
2025-05-07T20:32:18.9033055Z     scale_ub=1200.0,
2025-05-07T20:32:18.9033272Z     contiguous=False,
2025-05-07T20:32:18.9033485Z     compiled=False,
2025-05-07T20:32:18.9033689Z )
2025-05-07T20:32:18.9034008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.9034497Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:18.9034779Z 
2025-05-07T20:32:18.9034856Z     @given(
2025-05-07T20:32:18.9035080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.9035391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.9035689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.9036017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.9036384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.9036707Z     )
2025-05-07T20:32:18.9037062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.9037499Z     def test_silu_mul_quant(
2025-05-07T20:32:18.9037735Z         self,
2025-05-07T20:32:18.9037927Z         T: int,
2025-05-07T20:32:18.9038122Z         D: int,
2025-05-07T20:32:18.9038330Z         scale_ub: Optional[float],
2025-05-07T20:32:18.9038778Z         contiguous: bool,
2025-05-07T20:32:18.9039016Z         compiled: bool,
2025-05-07T20:32:18.9039238Z     ) -> None:
2025-05-07T20:32:18.9039446Z         torch.manual_seed(2025)
2025-05-07T20:32:18.9039682Z     
2025-05-07T20:32:18.9039953Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.9040287Z     
2025-05-07T20:32:18.9040478Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.9040766Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.9041070Z         x = x_sign * x_clamp
2025-05-07T20:32:18.9041313Z         x0 = x[:, :D]
2025-05-07T20:32:18.9041603Z         x1 = x[:, D:]
2025-05-07T20:32:18.9041804Z     
2025-05-07T20:32:18.9041987Z         if contiguous:
2025-05-07T20:32:18.9042225Z             x0 = x0.contiguous()
2025-05-07T20:32:18.9042477Z             x1 = x1.contiguous()
2025-05-07T20:32:18.9042714Z     
2025-05-07T20:32:18.9042904Z         if scale_ub is not None:
2025-05-07T20:32:18.9043167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.9043575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.9043877Z             )
2025-05-07T20:32:18.9044065Z         else:
2025-05-07T20:32:18.9044275Z             scale_ub_tensor = None
2025-05-07T20:32:18.9044524Z     
2025-05-07T20:32:18.9044752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.9045056Z             op = silu_mul_quant
2025-05-07T20:32:18.9045300Z             if compiled:
2025-05-07T20:32:18.9045544Z                 op = torch.compile(op)
2025-05-07T20:32:18.9045834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.9046113Z     
2025-05-07T20:32:18.9046303Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.9046465Z 
2025-05-07T20:32:18.9046563Z moe/activation_test.py:117: 
2025-05-07T20:32:18.9046850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.9047176Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.9047448Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.9048135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.9048827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.9049404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.9050104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.9050847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.9051394Z     kernel = self.compile(
2025-05-07T20:32:18.9051936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.9052589Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.9052993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.9053222Z 
2025-05-07T20:32:18.9053437Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a32b2210>
2025-05-07T20:32:18.9054504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.9055933Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a329c040>}
2025-05-07T20:32:18.9057336Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.9058362Z context = <triton._C.libtriton.ir.context object at 0x7f18a3272830>
2025-05-07T20:32:18.9058646Z 
2025-05-07T20:32:18.9058815Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.9059331Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.9059844Z                            module_map=module_map)
2025-05-07T20:32:18.9060205Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.9060556Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.9060807Z E       ^
2025-05-07T20:32:18.9061277Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.9061770Z 
2025-05-07T20:32:18.9062191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.9062704Z 
2025-05-07T20:32:18.9062803Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.9063212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.9063609Z     T=16384,
2025-05-07T20:32:18.9063796Z     D=7168,
2025-05-07T20:32:18.9063981Z     scale_ub=None,
2025-05-07T20:32:18.9064190Z     contiguous=True,
2025-05-07T20:32:18.9064410Z     compiled=True,
2025-05-07T20:32:18.9064605Z )
2025-05-07T20:32:19.0437567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.0438086Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.0438529Z 
2025-05-07T20:32:19.0438628Z     @given(
2025-05-07T20:32:19.0438861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.0439190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.0439523Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.0440147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.0440753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.0441275Z     )
2025-05-07T20:32:19.0441906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.0451828Z     def test_silu_mul_quant(
2025-05-07T20:32:19.0452074Z         self,
2025-05-07T20:32:19.0452264Z         T: int,
2025-05-07T20:32:19.0452463Z         D: int,
2025-05-07T20:32:19.0452676Z         scale_ub: Optional[float],
2025-05-07T20:32:19.0452940Z         contiguous: bool,
2025-05-07T20:32:19.0453181Z         compiled: bool,
2025-05-07T20:32:19.0453408Z     ) -> None:
2025-05-07T20:32:19.0453630Z         torch.manual_seed(2025)
2025-05-07T20:32:19.0453864Z     
2025-05-07T20:32:19.0454257Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.0454607Z     
2025-05-07T20:32:19.0454793Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.0455080Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.0455388Z         x = x_sign * x_clamp
2025-05-07T20:32:19.0455617Z         x0 = x[:, :D]
2025-05-07T20:32:19.0455832Z         x1 = x[:, D:]
2025-05-07T20:32:19.0456031Z     
2025-05-07T20:32:19.0456210Z         if contiguous:
2025-05-07T20:32:19.0456434Z             x0 = x0.contiguous()
2025-05-07T20:32:19.0456687Z             x1 = x1.contiguous()
2025-05-07T20:32:19.0456918Z     
2025-05-07T20:32:19.0457108Z         if scale_ub is not None:
2025-05-07T20:32:19.0457374Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.0457701Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.0458008Z             )
2025-05-07T20:32:19.0458271Z         else:
2025-05-07T20:32:19.0458486Z             scale_ub_tensor = None
2025-05-07T20:32:19.0458804Z     
2025-05-07T20:32:19.0459042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.0459351Z             op = silu_mul_quant
2025-05-07T20:32:19.0459591Z             if compiled:
2025-05-07T20:32:19.0459836Z                 op = torch.compile(op)
2025-05-07T20:32:19.0460129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.0460397Z     
2025-05-07T20:32:19.0460586Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.0460749Z 
2025-05-07T20:32:19.0460854Z moe/activation_test.py:117: 
2025-05-07T20:32:19.0461141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.0461463Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.0461745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.0462302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.0462861Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.0463525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.0464285Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.0464816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.0465501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.0466167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.0466702Z     kernel = self.compile(
2025-05-07T20:32:19.0467236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.0467892Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.0468291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.0468523Z 
2025-05-07T20:32:19.0468735Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a32c50d0>
2025-05-07T20:32:19.0469803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.0471160Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a329d260>}
2025-05-07T20:32:19.0472500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.0473520Z context = <triton._C.libtriton.ir.context object at 0x7f18a32cd730>
2025-05-07T20:32:19.0473807Z 
2025-05-07T20:32:19.0474018Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.0474546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.0475011Z                            module_map=module_map)
2025-05-07T20:32:19.0475372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.0475720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.0475974Z E       ^
2025-05-07T20:32:19.0476437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.0476884Z 
2025-05-07T20:32:19.0477303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.0477815Z 
2025-05-07T20:32:19.0477917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.0478365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0478767Z     T=4096,
2025-05-07T20:32:19.0478991Z     D=5120,
2025-05-07T20:32:19.0479186Z     scale_ub=None,
2025-05-07T20:32:19.0479417Z     contiguous=False,
2025-05-07T20:32:19.0479669Z     compiled=True,
2025-05-07T20:32:19.0479870Z )
2025-05-07T20:32:19.0480192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.0480681Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.0480955Z 
2025-05-07T20:32:19.0481049Z     @given(
2025-05-07T20:32:19.0481278Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.0481586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.0481883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.0482210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.0482532Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.0482808Z     )
2025-05-07T20:32:19.0483163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.0483788Z     def test_silu_mul_quant(
2025-05-07T20:32:19.0484027Z         self,
2025-05-07T20:32:19.0484211Z         T: int,
2025-05-07T20:32:19.0484406Z         D: int,
2025-05-07T20:32:19.0484626Z         scale_ub: Optional[float],
2025-05-07T20:32:19.0484891Z         contiguous: bool,
2025-05-07T20:32:19.0485123Z         compiled: bool,
2025-05-07T20:32:19.0485345Z     ) -> None:
2025-05-07T20:32:19.0485551Z         torch.manual_seed(2025)
2025-05-07T20:32:19.0485785Z     
2025-05-07T20:32:19.0486057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.0486395Z     
2025-05-07T20:32:19.0486594Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.0486882Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.0487179Z         x = x_sign * x_clamp
2025-05-07T20:32:19.0487413Z         x0 = x[:, :D]
2025-05-07T20:32:19.0487627Z         x1 = x[:, D:]
2025-05-07T20:32:19.0487827Z     
2025-05-07T20:32:19.0488015Z         if contiguous:
2025-05-07T20:32:19.0488250Z             x0 = x0.contiguous()
2025-05-07T20:32:19.0488505Z             x1 = x1.contiguous()
2025-05-07T20:32:19.0488739Z     
2025-05-07T20:32:19.0488927Z         if scale_ub is not None:
2025-05-07T20:32:19.0489195Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.0489523Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.0489829Z             )
2025-05-07T20:32:19.0490012Z         else:
2025-05-07T20:32:19.0490213Z             scale_ub_tensor = None
2025-05-07T20:32:19.0490461Z     
2025-05-07T20:32:19.0490695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.0490999Z             op = silu_mul_quant
2025-05-07T20:32:19.0491240Z             if compiled:
2025-05-07T20:32:19.0491484Z                 op = torch.compile(op)
2025-05-07T20:32:19.0491770Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.0492047Z     
2025-05-07T20:32:19.0492234Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.0492447Z 
2025-05-07T20:32:19.0492548Z moe/activation_test.py:117: 
2025-05-07T20:32:19.0492831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.0493157Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.0493432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.0493984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.0494539Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.0495198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.0495872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.0496404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.0497141Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.0497852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.0498379Z     kernel = self.compile(
2025-05-07T20:32:19.0498920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.0499577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.0499969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.0500197Z 
2025-05-07T20:32:19.0500401Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1c4a990>
2025-05-07T20:32:19.0501469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.0502825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a329dda0>}
2025-05-07T20:32:19.0504205Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.0505218Z context = <triton._C.libtriton.ir.context object at 0x7f16f1c06ff0>
2025-05-07T20:32:19.0505509Z 
2025-05-07T20:32:19.0505673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.0506193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.0506658Z                            module_map=module_map)
2025-05-07T20:32:19.0507016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.0507366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.0507621Z E       ^
2025-05-07T20:32:19.0508082Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.0508537Z 
2025-05-07T20:32:19.0508953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.0509478Z 
2025-05-07T20:32:19.1657228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1657700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1658105Z     T=4096,
2025-05-07T20:32:19.1658289Z     D=5120,
2025-05-07T20:32:19.1658480Z     scale_ub=1200.0,
2025-05-07T20:32:19.1658704Z     contiguous=False,
2025-05-07T20:32:19.1658922Z     compiled=False,
2025-05-07T20:32:19.1659126Z )
2025-05-07T20:32:19.1659446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1660195Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1660755Z 
2025-05-07T20:32:19.1661114Z     @given(
2025-05-07T20:32:19.1661572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1662179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1662771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1663413Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1664051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1664604Z     )
2025-05-07T20:32:19.1665287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1666159Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1666619Z         self,
2025-05-07T20:32:19.1667000Z         T: int,
2025-05-07T20:32:19.1667372Z         D: int,
2025-05-07T20:32:19.1667788Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1668315Z         contiguous: bool,
2025-05-07T20:32:19.1668778Z         compiled: bool,
2025-05-07T20:32:19.1669320Z     ) -> None:
2025-05-07T20:32:19.1669743Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1670137Z     
2025-05-07T20:32:19.1670419Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1670755Z     
2025-05-07T20:32:19.1670944Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1671228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1671532Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1671766Z         x0 = x[:, :D]
2025-05-07T20:32:19.1671977Z         x1 = x[:, D:]
2025-05-07T20:32:19.1672180Z     
2025-05-07T20:32:19.1672368Z         if contiguous:
2025-05-07T20:32:19.1672597Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1672855Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1673093Z     
2025-05-07T20:32:19.1673278Z         if scale_ub is not None:
2025-05-07T20:32:19.1673545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1673873Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1674181Z             )
2025-05-07T20:32:19.1674375Z         else:
2025-05-07T20:32:19.1674651Z             scale_ub_tensor = None
2025-05-07T20:32:19.1674904Z     
2025-05-07T20:32:19.1675130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1675441Z             op = silu_mul_quant
2025-05-07T20:32:19.1675694Z             if compiled:
2025-05-07T20:32:19.1675964Z                 op = torch.compile(op)
2025-05-07T20:32:19.1676252Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1676520Z     
2025-05-07T20:32:19.1676710Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1676872Z 
2025-05-07T20:32:19.1676969Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1677253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1677579Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1677857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1678541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1679233Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1679819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1680492Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1681153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1681687Z     kernel = self.compile(
2025-05-07T20:32:19.1682236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1682886Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1683355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1683577Z 
2025-05-07T20:32:19.1683789Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1c53c50>
2025-05-07T20:32:19.1684908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1686269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a329f420>}
2025-05-07T20:32:19.1687615Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1688639Z context = <triton._C.libtriton.ir.context object at 0x7f16f1c4c2b0>
2025-05-07T20:32:19.1688923Z 
2025-05-07T20:32:19.1689095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1689659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1690160Z                            module_map=module_map)
2025-05-07T20:32:19.1690521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1690872Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1691122Z E       ^
2025-05-07T20:32:19.1691585Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1692037Z 
2025-05-07T20:32:19.1692458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1692970Z 
2025-05-07T20:32:19.1693074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1693480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1693882Z     T=4096,
2025-05-07T20:32:19.1694069Z     D=5120,
2025-05-07T20:32:19.1694252Z     scale_ub=1200.0,
2025-05-07T20:32:19.1694477Z     contiguous=False,
2025-05-07T20:32:19.1694744Z     compiled=True,
2025-05-07T20:32:19.1694942Z )
2025-05-07T20:32:19.1695257Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1695750Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1696020Z 
2025-05-07T20:32:19.1696097Z     @given(
2025-05-07T20:32:19.1696326Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1696640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1696943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1697263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1697588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1697875Z     )
2025-05-07T20:32:19.1698215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1698654Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1698892Z         self,
2025-05-07T20:32:19.1699087Z         T: int,
2025-05-07T20:32:19.1699284Z         D: int,
2025-05-07T20:32:19.1699502Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1699763Z         contiguous: bool,
2025-05-07T20:32:19.1700001Z         compiled: bool,
2025-05-07T20:32:19.1700216Z     ) -> None:
2025-05-07T20:32:19.1700420Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1700661Z     
2025-05-07T20:32:19.1700933Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1701268Z     
2025-05-07T20:32:19.1701461Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1701749Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1702052Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1702283Z         x0 = x[:, :D]
2025-05-07T20:32:19.1702498Z         x1 = x[:, D:]
2025-05-07T20:32:19.1702700Z     
2025-05-07T20:32:19.1702881Z         if contiguous:
2025-05-07T20:32:19.1703114Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1703419Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1703661Z     
2025-05-07T20:32:19.1703844Z         if scale_ub is not None:
2025-05-07T20:32:19.1704115Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1704442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1704750Z             )
2025-05-07T20:32:19.1704938Z         else:
2025-05-07T20:32:19.1705142Z             scale_ub_tensor = None
2025-05-07T20:32:19.1705392Z     
2025-05-07T20:32:19.1705621Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1705928Z             op = silu_mul_quant
2025-05-07T20:32:19.1706173Z             if compiled:
2025-05-07T20:32:19.1706412Z                 op = torch.compile(op)
2025-05-07T20:32:19.1706705Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1706971Z     
2025-05-07T20:32:19.1707162Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1707323Z 
2025-05-07T20:32:19.1707473Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1707768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1708135Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1708418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1708969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1709529Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1710235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1710923Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1711449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1712133Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1712798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1713374Z     kernel = self.compile(
2025-05-07T20:32:19.1713919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1714579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1714973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1715201Z 
2025-05-07T20:32:19.1715406Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1e52b50>
2025-05-07T20:32:19.1716478Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1717840Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1cf0860>}
2025-05-07T20:32:19.1719185Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1720211Z context = <triton._C.libtriton.ir.context object at 0x7f16f1e5b170>
2025-05-07T20:32:19.1720494Z 
2025-05-07T20:32:19.1720661Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1721182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1721643Z                            module_map=module_map)
2025-05-07T20:32:19.1722000Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1722355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1722610Z E       ^
2025-05-07T20:32:19.1723077Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1723680Z 
2025-05-07T20:32:19.1724102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1724620Z 
2025-05-07T20:32:19.2601480Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2602350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2603152Z     T=2048,
2025-05-07T20:32:19.2605435Z     D=7168,
2025-05-07T20:32:19.2605800Z     scale_ub=1200.0,
2025-05-07T20:32:19.2606230Z     contiguous=False,
2025-05-07T20:32:19.2606663Z     compiled=False,
2025-05-07T20:32:19.2607065Z )
2025-05-07T20:32:19.2607694Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2608675Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.2609220Z 
2025-05-07T20:32:19.2609371Z     @given(
2025-05-07T20:32:19.2609968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2610377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2610680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2611005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2611335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2611615Z     )
2025-05-07T20:32:19.2611962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2612401Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2612635Z         self,
2025-05-07T20:32:19.2612828Z         T: int,
2025-05-07T20:32:19.2613023Z         D: int,
2025-05-07T20:32:19.2613242Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2614957Z         contiguous: bool,
2025-05-07T20:32:19.2615189Z         compiled: bool,
2025-05-07T20:32:19.2615409Z     ) -> None:
2025-05-07T20:32:19.2615611Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2615843Z     
2025-05-07T20:32:19.2616120Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2616528Z     
2025-05-07T20:32:19.2616716Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.2617002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.2617301Z         x = x_sign * x_clamp
2025-05-07T20:32:19.2617536Z         x0 = x[:, :D]
2025-05-07T20:32:19.2617745Z         x1 = x[:, D:]
2025-05-07T20:32:19.2617944Z     
2025-05-07T20:32:19.2618124Z         if contiguous:
2025-05-07T20:32:19.2618354Z             x0 = x0.contiguous()
2025-05-07T20:32:19.2618602Z             x1 = x1.contiguous()
2025-05-07T20:32:19.2618836Z     
2025-05-07T20:32:19.2619031Z         if scale_ub is not None:
2025-05-07T20:32:19.2619292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.2619627Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.2619925Z             )
2025-05-07T20:32:19.2620112Z         else:
2025-05-07T20:32:19.2620316Z             scale_ub_tensor = None
2025-05-07T20:32:19.2620563Z     
2025-05-07T20:32:19.2620800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.2621108Z             op = silu_mul_quant
2025-05-07T20:32:19.2621354Z             if compiled:
2025-05-07T20:32:19.2621595Z                 op = torch.compile(op)
2025-05-07T20:32:19.2621883Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2622156Z     
2025-05-07T20:32:19.2622350Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.2622510Z 
2025-05-07T20:32:19.2622607Z moe/activation_test.py:117: 
2025-05-07T20:32:19.2622896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2623229Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.2623506Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2624190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.2624909Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.2625523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.2626211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.2626868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.2627398Z     kernel = self.compile(
2025-05-07T20:32:19.2627943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.2635092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.2635528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2635756Z 
2025-05-07T20:32:19.2635967Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1ed7f50>
2025-05-07T20:32:19.2637138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.2638722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1cf16c0>}
2025-05-07T20:32:19.2640122Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.2641143Z context = <triton._C.libtriton.ir.context object at 0x7f16f1e885f0>
2025-05-07T20:32:19.2641427Z 
2025-05-07T20:32:19.2641595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.2642116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.2642592Z                            module_map=module_map)
2025-05-07T20:32:19.2643045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.2643526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.2643785Z E       ^
2025-05-07T20:32:19.2644246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.2644697Z 
2025-05-07T20:32:19.2645117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.2645642Z 
2025-05-07T20:32:19.2645744Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2646154Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2646542Z     T=1,
2025-05-07T20:32:19.2646718Z     D=7168,
2025-05-07T20:32:19.2646906Z     scale_ub=None,
2025-05-07T20:32:19.2647116Z     contiguous=True,
2025-05-07T20:32:19.2647329Z     compiled=False,
2025-05-07T20:32:19.2647532Z )
2025-05-07T20:32:19.2647853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2648336Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.2648597Z 
2025-05-07T20:32:19.2648673Z     @given(
2025-05-07T20:32:19.2648903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2649210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2649509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2649826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2650141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2650420Z     )
2025-05-07T20:32:19.2650764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2651208Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2651439Z         self,
2025-05-07T20:32:19.2651626Z         T: int,
2025-05-07T20:32:19.2651819Z         D: int,
2025-05-07T20:32:19.2652026Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2652367Z         contiguous: bool,
2025-05-07T20:32:19.2652603Z         compiled: bool,
2025-05-07T20:32:19.2652817Z     ) -> None:
2025-05-07T20:32:19.2653032Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2653269Z     
2025-05-07T20:32:19.2653532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2653872Z     
2025-05-07T20:32:19.2654059Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.2654341Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.2654642Z         x = x_sign * x_clamp
2025-05-07T20:32:19.2654877Z         x0 = x[:, :D]
2025-05-07T20:32:19.2655082Z         x1 = x[:, D:]
2025-05-07T20:32:19.2655283Z     
2025-05-07T20:32:19.2655468Z         if contiguous:
2025-05-07T20:32:19.2655701Z             x0 = x0.contiguous()
2025-05-07T20:32:19.2655948Z             x1 = x1.contiguous()
2025-05-07T20:32:19.2656181Z     
2025-05-07T20:32:19.2656448Z         if scale_ub is not None:
2025-05-07T20:32:19.2656766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.2657096Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.2657400Z             )
2025-05-07T20:32:19.2657583Z         else:
2025-05-07T20:32:19.2657789Z             scale_ub_tensor = None
2025-05-07T20:32:19.2658036Z     
2025-05-07T20:32:19.2658257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.2658562Z             op = silu_mul_quant
2025-05-07T20:32:19.2658809Z             if compiled:
2025-05-07T20:32:19.2659045Z                 op = torch.compile(op)
2025-05-07T20:32:19.2659341Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2659617Z     
2025-05-07T20:32:19.2659805Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.2659976Z 
2025-05-07T20:32:19.2660093Z moe/activation_test.py:117: 
2025-05-07T20:32:19.2660408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2660735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.2661067Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2661753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.2662440Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.2662974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.2663655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.2664311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.2664839Z     kernel = self.compile(
2025-05-07T20:32:19.2665378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.2666032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.2666423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2666655Z 
2025-05-07T20:32:19.2666859Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1edb750>
2025-05-07T20:32:19.2667934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.2669291Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1cf0fe0>}
2025-05-07T20:32:19.2670626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.2671647Z context = <triton._C.libtriton.ir.context object at 0x7f16f1e37d70>
2025-05-07T20:32:19.2671982Z 
2025-05-07T20:32:19.2672148Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.2672663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.2673118Z                            module_map=module_map)
2025-05-07T20:32:19.2673476Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.2673828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.2674077Z E       ^
2025-05-07T20:32:19.2674533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.2674988Z 
2025-05-07T20:32:19.2675405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.2675919Z 
2025-05-07T20:32:19.2676024Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2676472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2676908Z     T=16384,
2025-05-07T20:32:19.2677090Z     D=7168,
2025-05-07T20:32:19.2677272Z     scale_ub=1200.0,
2025-05-07T20:32:19.2677490Z     contiguous=False,
2025-05-07T20:32:19.2677710Z     compiled=True,
2025-05-07T20:32:19.6324326Z )
2025-05-07T20:32:19.6325136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.6325693Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.6325982Z 
2025-05-07T20:32:19.6326066Z     @given(
2025-05-07T20:32:19.6326318Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.6326639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.6326954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.6327287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.6327645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.6327941Z     )
2025-05-07T20:32:19.6328311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.6329099Z     def test_silu_mul_quant(
2025-05-07T20:32:19.6329349Z         self,
2025-05-07T20:32:19.6329544Z         T: int,
2025-05-07T20:32:19.6329752Z         D: int,
2025-05-07T20:32:19.6329982Z         scale_ub: Optional[float],
2025-05-07T20:32:19.6330253Z         contiguous: bool,
2025-05-07T20:32:19.6330500Z         compiled: bool,
2025-05-07T20:32:19.6330739Z     ) -> None:
2025-05-07T20:32:19.6330955Z         torch.manual_seed(2025)
2025-05-07T20:32:19.6331203Z     
2025-05-07T20:32:19.6331487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.6331840Z     
2025-05-07T20:32:19.6332033Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.6332374Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.6332804Z         x = x_sign * x_clamp
2025-05-07T20:32:19.6333049Z         x0 = x[:, :D]
2025-05-07T20:32:19.6333280Z         x1 = x[:, D:]
2025-05-07T20:32:19.6333497Z     
2025-05-07T20:32:19.6333686Z         if contiguous:
2025-05-07T20:32:19.6333936Z             x0 = x0.contiguous()
2025-05-07T20:32:19.6334204Z             x1 = x1.contiguous()
2025-05-07T20:32:19.6334446Z     
2025-05-07T20:32:19.6334652Z         if scale_ub is not None:
2025-05-07T20:32:19.6334930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.6335268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.6335589Z             )
2025-05-07T20:32:19.6335787Z         else:
2025-05-07T20:32:19.6335997Z             scale_ub_tensor = None
2025-05-07T20:32:19.6336259Z     
2025-05-07T20:32:19.6336506Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.6336820Z             op = silu_mul_quant
2025-05-07T20:32:19.6337080Z             if compiled:
2025-05-07T20:32:19.6337341Z                 op = torch.compile(op)
2025-05-07T20:32:19.6337650Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.6338037Z     
2025-05-07T20:32:19.6338241Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.6338755Z 
2025-05-07T20:32:19.6338879Z moe/activation_test.py:117: 
2025-05-07T20:32:19.6339178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6339524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.6339857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.6340472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.6341043Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.6341724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.6342434Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.6343099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.6344130Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.6344840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.6345392Z     kernel = self.compile(
2025-05-07T20:32:19.6345954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.6346627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.6347037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6347270Z 
2025-05-07T20:32:19.6347491Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a262e250>
2025-05-07T20:32:19.6348582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.6350182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1cf3b00>}
2025-05-07T20:32:19.6351549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.6352624Z context = <triton._C.libtriton.ir.context object at 0x7f18a26528b0>
2025-05-07T20:32:19.6352916Z 
2025-05-07T20:32:19.6353089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.6353628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.6354112Z                            module_map=module_map)
2025-05-07T20:32:19.6354490Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.6354953Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.6355311Z E       ^
2025-05-07T20:32:19.6355792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.6356249Z 
2025-05-07T20:32:19.6356676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.6357211Z 
2025-05-07T20:32:19.6357320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.6357743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.6358157Z     T=1,
2025-05-07T20:32:19.6358343Z     D=7168,
2025-05-07T20:32:19.6358548Z     scale_ub=None,
2025-05-07T20:32:19.6358776Z     contiguous=False,
2025-05-07T20:32:19.6358997Z     compiled=False,
2025-05-07T20:32:19.6359219Z )
2025-05-07T20:32:19.6359549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.6360123Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.6360408Z 
2025-05-07T20:32:19.6360486Z     @given(
2025-05-07T20:32:19.6360724Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.6361043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.6361348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.6361682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.6362016Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.6362299Z     )
2025-05-07T20:32:19.6362654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.6363106Z     def test_silu_mul_quant(
2025-05-07T20:32:19.6363515Z         self,
2025-05-07T20:32:19.6363715Z         T: int,
2025-05-07T20:32:19.6363914Z         D: int,
2025-05-07T20:32:19.6364126Z         scale_ub: Optional[float],
2025-05-07T20:32:19.6364406Z         contiguous: bool,
2025-05-07T20:32:19.6364701Z         compiled: bool,
2025-05-07T20:32:19.6364965Z     ) -> None:
2025-05-07T20:32:19.6365195Z         torch.manual_seed(2025)
2025-05-07T20:32:19.6365439Z     
2025-05-07T20:32:19.6365723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.6366069Z     
2025-05-07T20:32:19.6366267Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.6366564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.6366876Z         x = x_sign * x_clamp
2025-05-07T20:32:19.6367130Z         x0 = x[:, :D]
2025-05-07T20:32:19.6367359Z         x1 = x[:, D:]
2025-05-07T20:32:19.6367567Z     
2025-05-07T20:32:19.6367758Z         if contiguous:
2025-05-07T20:32:19.6368001Z             x0 = x0.contiguous()
2025-05-07T20:32:19.6368259Z             x1 = x1.contiguous()
2025-05-07T20:32:19.6368505Z     
2025-05-07T20:32:19.6368707Z         if scale_ub is not None:
2025-05-07T20:32:19.6368979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.6369325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.6369720Z             )
2025-05-07T20:32:19.6369937Z         else:
2025-05-07T20:32:19.6370153Z             scale_ub_tensor = None
2025-05-07T20:32:19.6370411Z     
2025-05-07T20:32:19.6370650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.6370966Z             op = silu_mul_quant
2025-05-07T20:32:19.6371221Z             if compiled:
2025-05-07T20:32:19.6371474Z                 op = torch.compile(op)
2025-05-07T20:32:19.6371772Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.6372060Z     
2025-05-07T20:32:19.6372259Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.6372423Z 
2025-05-07T20:32:19.6372521Z moe/activation_test.py:117: 
2025-05-07T20:32:19.6372819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6373157Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.6373437Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.6374137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.6374850Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.6375395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.6376088Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.6376765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.6377308Z     kernel = self.compile(
2025-05-07T20:32:19.6377867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.6378530Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.6378935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6379167Z 
2025-05-07T20:32:19.6379437Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a266b910>
2025-05-07T20:32:19.6380573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.6381951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a268c9a0>}
2025-05-07T20:32:19.6383306Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.6384349Z context = <triton._C.libtriton.ir.context object at 0x7f18a2627f30>
2025-05-07T20:32:19.6384639Z 
2025-05-07T20:32:19.6384856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.6385426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.6385903Z                            module_map=module_map)
2025-05-07T20:32:19.6386275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.6386630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.6386899Z E       ^
2025-05-07T20:32:19.6387370Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.6387828Z 
2025-05-07T20:32:19.6388258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.6388778Z 
2025-05-07T20:32:19.6388880Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.6389301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.6389716Z     T=2048,
2025-05-07T20:32:19.6389913Z     D=7168,
2025-05-07T20:32:19.6390111Z     scale_ub=None,
2025-05-07T20:32:19.6390380Z     contiguous=False,
2025-05-07T20:32:19.6390619Z     compiled=True,
2025-05-07T20:32:19.6390818Z )
2025-05-07T20:32:19.7082781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.7083513Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.7083798Z 
2025-05-07T20:32:19.7083880Z     @given(
2025-05-07T20:32:19.7084121Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.7084433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.7084744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.7085081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.7085406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.7085705Z     )
2025-05-07T20:32:19.7086081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.7086546Z     def test_silu_mul_quant(
2025-05-07T20:32:19.7086797Z         self,
2025-05-07T20:32:19.7087001Z         T: int,
2025-05-07T20:32:19.7087209Z         D: int,
2025-05-07T20:32:19.7087424Z         scale_ub: Optional[float],
2025-05-07T20:32:19.7087701Z         contiguous: bool,
2025-05-07T20:32:19.7087944Z         compiled: bool,
2025-05-07T20:32:19.7088166Z     ) -> None:
2025-05-07T20:32:19.7088389Z         torch.manual_seed(2025)
2025-05-07T20:32:19.7088635Z     
2025-05-07T20:32:19.7088910Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.7089256Z     
2025-05-07T20:32:19.7089459Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.7089747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.7090065Z         x = x_sign * x_clamp
2025-05-07T20:32:19.7090310Z         x0 = x[:, :D]
2025-05-07T20:32:19.7090527Z         x1 = x[:, D:]
2025-05-07T20:32:19.7090740Z     
2025-05-07T20:32:19.7090945Z         if contiguous:
2025-05-07T20:32:19.7091462Z             x0 = x0.contiguous()
2025-05-07T20:32:19.7091747Z             x1 = x1.contiguous()
2025-05-07T20:32:19.7092007Z     
2025-05-07T20:32:19.7092200Z         if scale_ub is not None:
2025-05-07T20:32:19.7092485Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.7092833Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.7093151Z             )
2025-05-07T20:32:19.7093345Z         else:
2025-05-07T20:32:19.7093566Z             scale_ub_tensor = None
2025-05-07T20:32:19.7093828Z     
2025-05-07T20:32:19.7094062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.7094385Z             op = silu_mul_quant
2025-05-07T20:32:19.7094644Z             if compiled:
2025-05-07T20:32:19.7094891Z                 op = torch.compile(op)
2025-05-07T20:32:19.7095193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.7095479Z     
2025-05-07T20:32:19.7095795Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.7095972Z 
2025-05-07T20:32:19.7096158Z moe/activation_test.py:117: 
2025-05-07T20:32:19.7096467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.7096814Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.7097098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.7097666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.7098248Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.7098909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.7099610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.7100159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.7100856Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.7101526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.7102160Z     kernel = self.compile(
2025-05-07T20:32:19.7102712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.7103374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.7103776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.7104015Z 
2025-05-07T20:32:19.7104223Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1d62a10>
2025-05-07T20:32:19.7105303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.7106677Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a268dd00>}
2025-05-07T20:32:19.7108026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.7109054Z context = <triton._C.libtriton.ir.context object at 0x7f16f1da3170>
2025-05-07T20:32:19.7109340Z 
2025-05-07T20:32:19.7109515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.7110045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.7110513Z                            module_map=module_map)
2025-05-07T20:32:19.7110884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.7111252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.7111513Z E       ^
2025-05-07T20:32:19.7112031Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.7112491Z 
2025-05-07T20:32:19.7112921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.7113443Z 
2025-05-07T20:32:19.7113556Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.7113971Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.7114380Z     T=4096,
2025-05-07T20:32:19.7114579Z     D=7168,
2025-05-07T20:32:19.7114772Z     scale_ub=None,
2025-05-07T20:32:19.7114998Z     contiguous=False,
2025-05-07T20:32:19.7115232Z     compiled=True,
2025-05-07T20:32:19.7115439Z )
2025-05-07T20:32:19.7115763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.7116264Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.7116539Z 
2025-05-07T20:32:19.7116672Z     @given(
2025-05-07T20:32:19.7116941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.7117264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.7117577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.7117906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.7118241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.7118550Z     )
2025-05-07T20:32:19.7118896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.7119348Z     def test_silu_mul_quant(
2025-05-07T20:32:19.7119600Z         self,
2025-05-07T20:32:19.7119799Z         T: int,
2025-05-07T20:32:19.7120006Z         D: int,
2025-05-07T20:32:19.7120226Z         scale_ub: Optional[float],
2025-05-07T20:32:19.7120494Z         contiguous: bool,
2025-05-07T20:32:19.7129828Z         compiled: bool,
2025-05-07T20:32:19.7130122Z     ) -> None:
2025-05-07T20:32:19.7130351Z         torch.manual_seed(2025)
2025-05-07T20:32:19.7130610Z     
2025-05-07T20:32:19.7130906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.7131329Z     
2025-05-07T20:32:19.7131540Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.7131849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.7132160Z         x = x_sign * x_clamp
2025-05-07T20:32:19.7132412Z         x0 = x[:, :D]
2025-05-07T20:32:19.7132641Z         x1 = x[:, D:]
2025-05-07T20:32:19.7132847Z     
2025-05-07T20:32:19.7133049Z         if contiguous:
2025-05-07T20:32:19.7133294Z             x0 = x0.contiguous()
2025-05-07T20:32:19.7133555Z             x1 = x1.contiguous()
2025-05-07T20:32:19.7133808Z     
2025-05-07T20:32:19.7134012Z         if scale_ub is not None:
2025-05-07T20:32:19.7134287Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.7134636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.7134957Z             )
2025-05-07T20:32:19.7135168Z         else:
2025-05-07T20:32:19.7135384Z             scale_ub_tensor = None
2025-05-07T20:32:19.7135656Z     
2025-05-07T20:32:19.7135903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.7136222Z             op = silu_mul_quant
2025-05-07T20:32:19.7136487Z             if compiled:
2025-05-07T20:32:19.7136742Z                 op = torch.compile(op)
2025-05-07T20:32:19.7137037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.7137321Z     
2025-05-07T20:32:19.7137523Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.7137687Z 
2025-05-07T20:32:19.7137789Z moe/activation_test.py:117: 
2025-05-07T20:32:19.7138095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.7138784Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.7139100Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.7139692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.7140294Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.7141062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.7141907Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.7142557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.7143393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.7144207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.7144852Z     kernel = self.compile(
2025-05-07T20:32:19.7145511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.7146319Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.7146855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.7147191Z 
2025-05-07T20:32:19.7147432Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1d2c350>
2025-05-07T20:32:19.7148783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.7150519Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a268e840>}
2025-05-07T20:32:19.7152209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.7153480Z context = <triton._C.libtriton.ir.context object at 0x7f16f1dac930>
2025-05-07T20:32:19.7153833Z 
2025-05-07T20:32:19.7154028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.7154721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.7155285Z                            module_map=module_map)
2025-05-07T20:32:19.7155695Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.7156108Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.7156404Z E       ^
2025-05-07T20:32:19.7156952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.7157519Z 
2025-05-07T20:32:19.7158031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.7158678Z 
2025-05-07T20:32:19.8419436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.8419953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.8420468Z     T=16384,
2025-05-07T20:32:19.8420748Z     D=5120,
2025-05-07T20:32:19.8421030Z     scale_ub=1200.0,
2025-05-07T20:32:19.8421349Z     contiguous=False,
2025-05-07T20:32:19.8421640Z     compiled=False,
2025-05-07T20:32:19.8421864Z )
2025-05-07T20:32:19.8422199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.8422722Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.8423009Z 
2025-05-07T20:32:19.8423095Z     @given(
2025-05-07T20:32:19.8423342Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.8423673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.8423988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.8424328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.8424657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.8424954Z     )
2025-05-07T20:32:19.8425617Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.8426069Z     def test_silu_mul_quant(
2025-05-07T20:32:19.8426319Z         self,
2025-05-07T20:32:19.8426523Z         T: int,
2025-05-07T20:32:19.8426721Z         D: int,
2025-05-07T20:32:19.8426947Z         scale_ub: Optional[float],
2025-05-07T20:32:19.8427228Z         contiguous: bool,
2025-05-07T20:32:19.8427467Z         compiled: bool,
2025-05-07T20:32:19.8427705Z     ) -> None:
2025-05-07T20:32:19.8427929Z         torch.manual_seed(2025)
2025-05-07T20:32:19.8428170Z     
2025-05-07T20:32:19.8428458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.8428810Z     
2025-05-07T20:32:19.8429005Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.8429307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.8429631Z         x = x_sign * x_clamp
2025-05-07T20:32:19.8429905Z         x0 = x[:, :D]
2025-05-07T20:32:19.8430234Z         x1 = x[:, D:]
2025-05-07T20:32:19.8430459Z     
2025-05-07T20:32:19.8430749Z         if contiguous:
2025-05-07T20:32:19.8430986Z             x0 = x0.contiguous()
2025-05-07T20:32:19.8431252Z             x1 = x1.contiguous()
2025-05-07T20:32:19.8431504Z     
2025-05-07T20:32:19.8431700Z         if scale_ub is not None:
2025-05-07T20:32:19.8431982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.8432326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.8432635Z             )
2025-05-07T20:32:19.8432841Z         else:
2025-05-07T20:32:19.8433072Z             scale_ub_tensor = None
2025-05-07T20:32:19.8433329Z     
2025-05-07T20:32:19.8433577Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.8433906Z             op = silu_mul_quant
2025-05-07T20:32:19.8434163Z             if compiled:
2025-05-07T20:32:19.8434424Z                 op = torch.compile(op)
2025-05-07T20:32:19.8434731Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.8435020Z     
2025-05-07T20:32:19.8435219Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.8435487Z 
2025-05-07T20:32:19.8435593Z moe/activation_test.py:117: 
2025-05-07T20:32:19.8435899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.8436232Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.8436530Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.8437233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.8437922Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.8438672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.8439366Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.8440043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.8440575Z     kernel = self.compile(
2025-05-07T20:32:19.8441132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.8441797Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.8442194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.8442423Z 
2025-05-07T20:32:19.8442632Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a27c9490>
2025-05-07T20:32:19.8443812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.8445202Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2784040>}
2025-05-07T20:32:19.8446623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.8447655Z context = <triton._C.libtriton.ir.context object at 0x7f18a2725ab0>
2025-05-07T20:32:19.8447947Z 
2025-05-07T20:32:19.8448115Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.8448644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.8449117Z                            module_map=module_map)
2025-05-07T20:32:19.8449486Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.8449894Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.8450166Z E       ^
2025-05-07T20:32:19.8450692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.8451155Z 
2025-05-07T20:32:19.8452270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.8452795Z 
2025-05-07T20:32:19.8452901Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.8453317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.8453720Z     T=16384,
2025-05-07T20:32:19.8453915Z     D=5120,
2025-05-07T20:32:19.8454120Z     scale_ub=1200.0,
2025-05-07T20:32:19.8454339Z     contiguous=True,
2025-05-07T20:32:19.8454565Z     compiled=True,
2025-05-07T20:32:19.8454786Z )
2025-05-07T20:32:19.8455103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.8455607Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.8455892Z 
2025-05-07T20:32:19.8455974Z     @given(
2025-05-07T20:32:19.8456209Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.8456519Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.8456907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.8457235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.8457559Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.8457851Z     )
2025-05-07T20:32:19.8458206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.8458644Z     def test_silu_mul_quant(
2025-05-07T20:32:19.8458889Z         self,
2025-05-07T20:32:19.8459089Z         T: int,
2025-05-07T20:32:19.8459283Z         D: int,
2025-05-07T20:32:19.8459511Z         scale_ub: Optional[float],
2025-05-07T20:32:19.8459786Z         contiguous: bool,
2025-05-07T20:32:19.8460030Z         compiled: bool,
2025-05-07T20:32:19.8460252Z     ) -> None:
2025-05-07T20:32:19.8460475Z         torch.manual_seed(2025)
2025-05-07T20:32:19.8460719Z     
2025-05-07T20:32:19.8460995Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.8461339Z     
2025-05-07T20:32:19.8461543Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.8461851Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.8462172Z         x = x_sign * x_clamp
2025-05-07T20:32:19.8462414Z         x0 = x[:, :D]
2025-05-07T20:32:19.8462639Z         x1 = x[:, D:]
2025-05-07T20:32:19.8462843Z     
2025-05-07T20:32:19.8463034Z         if contiguous:
2025-05-07T20:32:19.8463270Z             x0 = x0.contiguous()
2025-05-07T20:32:19.8463524Z             x1 = x1.contiguous()
2025-05-07T20:32:19.8463764Z     
2025-05-07T20:32:19.8463963Z         if scale_ub is not None:
2025-05-07T20:32:19.8464230Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.8464564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.8464888Z             )
2025-05-07T20:32:19.8465089Z         else:
2025-05-07T20:32:19.8465305Z             scale_ub_tensor = None
2025-05-07T20:32:19.8465565Z     
2025-05-07T20:32:19.8465841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.8466171Z             op = silu_mul_quant
2025-05-07T20:32:19.8466424Z             if compiled:
2025-05-07T20:32:19.8466668Z                 op = torch.compile(op)
2025-05-07T20:32:19.8466968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.8467246Z     
2025-05-07T20:32:19.8467437Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.8467609Z 
2025-05-07T20:32:19.8467710Z moe/activation_test.py:117: 
2025-05-07T20:32:19.8468007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.8468335Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.8468624Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.8469186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.8469751Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.8470488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.8471225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.8471766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.8472444Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.8473110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.8473647Z     kernel = self.compile(
2025-05-07T20:32:19.8474191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.8474847Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.8475248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.8475474Z 
2025-05-07T20:32:19.8475692Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2798310>
2025-05-07T20:32:19.8476813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.8478173Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2785300>}
2025-05-07T20:32:19.8479518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.8480546Z context = <triton._C.libtriton.ir.context object at 0x7f18a27f09b0>
2025-05-07T20:32:19.8480834Z 
2025-05-07T20:32:19.8481010Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.8481531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.8482007Z                            module_map=module_map)
2025-05-07T20:32:19.8482374Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.8482733Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.8482987Z E       ^
2025-05-07T20:32:19.8483628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.8484081Z 
2025-05-07T20:32:19.8484505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.8485017Z 
2025-05-07T20:32:20.1631156Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.1631692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.1632135Z     T=16384,
2025-05-07T20:32:20.1632364Z     D=5120,
2025-05-07T20:32:20.1632567Z     scale_ub=None,
2025-05-07T20:32:20.1633140Z     contiguous=False,
2025-05-07T20:32:20.1633380Z     compiled=True,
2025-05-07T20:32:20.1633594Z )
2025-05-07T20:32:20.1633918Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.1634411Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:20.1634696Z 
2025-05-07T20:32:20.1634776Z     @given(
2025-05-07T20:32:20.1635009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.1635325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.1635625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.1635953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.1636281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.1636559Z     )
2025-05-07T20:32:20.1636909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.1637446Z     def test_silu_mul_quant(
2025-05-07T20:32:20.1637779Z         self,
2025-05-07T20:32:20.1637980Z         T: int,
2025-05-07T20:32:20.1638180Z         D: int,
2025-05-07T20:32:20.1638677Z         scale_ub: Optional[float],
2025-05-07T20:32:20.1638983Z         contiguous: bool,
2025-05-07T20:32:20.1639245Z         compiled: bool,
2025-05-07T20:32:20.1639486Z     ) -> None:
2025-05-07T20:32:20.1639719Z         torch.manual_seed(2025)
2025-05-07T20:32:20.1640027Z     
2025-05-07T20:32:20.1640329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.1640725Z     
2025-05-07T20:32:20.1640933Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.1641254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.1641596Z         x = x_sign * x_clamp
2025-05-07T20:32:20.1641856Z         x0 = x[:, :D]
2025-05-07T20:32:20.1642091Z         x1 = x[:, D:]
2025-05-07T20:32:20.1642312Z     
2025-05-07T20:32:20.1642511Z         if contiguous:
2025-05-07T20:32:20.1642769Z             x0 = x0.contiguous()
2025-05-07T20:32:20.1643057Z             x1 = x1.contiguous()
2025-05-07T20:32:20.1643565Z     
2025-05-07T20:32:20.1643762Z         if scale_ub is not None:
2025-05-07T20:32:20.1644031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.1644374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.1644695Z             )
2025-05-07T20:32:20.1644888Z         else:
2025-05-07T20:32:20.1645101Z             scale_ub_tensor = None
2025-05-07T20:32:20.1645365Z     
2025-05-07T20:32:20.1645602Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.1645925Z             op = silu_mul_quant
2025-05-07T20:32:20.1646178Z             if compiled:
2025-05-07T20:32:20.1646430Z                 op = torch.compile(op)
2025-05-07T20:32:20.1646721Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.1647001Z     
2025-05-07T20:32:20.1647203Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.1647367Z 
2025-05-07T20:32:20.1647471Z moe/activation_test.py:117: 
2025-05-07T20:32:20.1647778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.1648118Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.1648396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.1648961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.1649529Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.1650200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.1650882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.1651422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.1652106Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.1652881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.1653432Z     kernel = self.compile(
2025-05-07T20:32:20.1653979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.1654645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.1655039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.1655272Z 
2025-05-07T20:32:20.1655480Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1a8d7d0>
2025-05-07T20:32:20.1656561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.1658021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2785e40>}
2025-05-07T20:32:20.1659415Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.1660490Z context = <triton._C.libtriton.ir.context object at 0x7f16f1a25e30>
2025-05-07T20:32:20.1660783Z 
2025-05-07T20:32:20.1660950Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.1661474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.1661939Z                            module_map=module_map)
2025-05-07T20:32:20.1662309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.1662670Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.1662932Z E       ^
2025-05-07T20:32:20.1663401Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.1663914Z 
2025-05-07T20:32:20.1664333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.1664850Z 
2025-05-07T20:32:20.1664966Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.1665379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.1665778Z     T=2048,
2025-05-07T20:32:20.1665969Z     D=5120,
2025-05-07T20:32:20.1666167Z     scale_ub=None,
2025-05-07T20:32:20.1666383Z     contiguous=False,
2025-05-07T20:32:20.1666616Z     compiled=True,
2025-05-07T20:32:20.1666826Z )
2025-05-07T20:32:20.2394768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.2395402Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:20.2395726Z 
2025-05-07T20:32:20.2395809Z     @given(
2025-05-07T20:32:20.2396077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.2396457Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.2396796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.2397173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.2397550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.2397866Z     )
2025-05-07T20:32:20.2398220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.2398665Z     def test_silu_mul_quant(
2025-05-07T20:32:20.2398913Z         self,
2025-05-07T20:32:20.2399102Z         T: int,
2025-05-07T20:32:20.2399303Z         D: int,
2025-05-07T20:32:20.2399526Z         scale_ub: Optional[float],
2025-05-07T20:32:20.2399792Z         contiguous: bool,
2025-05-07T20:32:20.2400035Z         compiled: bool,
2025-05-07T20:32:20.2400267Z     ) -> None:
2025-05-07T20:32:20.2400477Z         torch.manual_seed(2025)
2025-05-07T20:32:20.2400716Z     
2025-05-07T20:32:20.2401271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.2401619Z     
2025-05-07T20:32:20.2401812Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.2402104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.2402407Z         x = x_sign * x_clamp
2025-05-07T20:32:20.2402645Z         x0 = x[:, :D]
2025-05-07T20:32:20.2402866Z         x1 = x[:, D:]
2025-05-07T20:32:20.2403069Z     
2025-05-07T20:32:20.2403438Z         if contiguous:
2025-05-07T20:32:20.2403678Z             x0 = x0.contiguous()
2025-05-07T20:32:20.2403937Z             x1 = x1.contiguous()
2025-05-07T20:32:20.2404173Z     
2025-05-07T20:32:20.2404368Z         if scale_ub is not None:
2025-05-07T20:32:20.2404640Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.2404970Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.2405280Z             )
2025-05-07T20:32:20.2405474Z         else:
2025-05-07T20:32:20.2405769Z             scale_ub_tensor = None
2025-05-07T20:32:20.2406100Z     
2025-05-07T20:32:20.2406342Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.2406654Z             op = silu_mul_quant
2025-05-07T20:32:20.2406908Z             if compiled:
2025-05-07T20:32:20.2407160Z                 op = torch.compile(op)
2025-05-07T20:32:20.2407451Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.2407729Z     
2025-05-07T20:32:20.2407926Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.2408109Z 
2025-05-07T20:32:20.2408215Z moe/activation_test.py:117: 
2025-05-07T20:32:20.2408500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.2408835Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.2409119Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.2409675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.2410297Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.2410958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.2419887Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.2420465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.2421158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.2421823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.2422367Z     kernel = self.compile(
2025-05-07T20:32:20.2422919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.2423586Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.2423987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.2424227Z 
2025-05-07T20:32:20.2424440Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1a86a10>
2025-05-07T20:32:20.2425525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.2426904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2787240>}
2025-05-07T20:32:20.2428236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.2429265Z context = <triton._C.libtriton.ir.context object at 0x7f16f1a37030>
2025-05-07T20:32:20.2429557Z 
2025-05-07T20:32:20.2429727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.2430374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.2430842Z                            module_map=module_map)
2025-05-07T20:32:20.2431213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.2431575Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.2431830Z E       ^
2025-05-07T20:32:20.2432295Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.2432755Z 
2025-05-07T20:32:20.2433174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.2433687Z 
2025-05-07T20:32:20.2433798Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.2434205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.2434658Z     T=2048,
2025-05-07T20:32:20.2434851Z     D=5120,
2025-05-07T20:32:20.2435106Z     scale_ub=1200.0,
2025-05-07T20:32:20.2435329Z     contiguous=False,
2025-05-07T20:32:20.2435569Z     compiled=True,
2025-05-07T20:32:20.2435786Z )
2025-05-07T20:32:20.2436104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.2436602Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:20.2436877Z 
2025-05-07T20:32:20.2436967Z     @given(
2025-05-07T20:32:20.2437194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.2437510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.2437823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.2438144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.2438910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.2439255Z     )
2025-05-07T20:32:20.2439620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.2440075Z     def test_silu_mul_quant(
2025-05-07T20:32:20.2440420Z         self,
2025-05-07T20:32:20.2440622Z         T: int,
2025-05-07T20:32:20.2440815Z         D: int,
2025-05-07T20:32:20.2441039Z         scale_ub: Optional[float],
2025-05-07T20:32:20.2441312Z         contiguous: bool,
2025-05-07T20:32:20.2441551Z         compiled: bool,
2025-05-07T20:32:20.2441781Z     ) -> None:
2025-05-07T20:32:20.2441995Z         torch.manual_seed(2025)
2025-05-07T20:32:20.2442234Z     
2025-05-07T20:32:20.2442516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.2442866Z     
2025-05-07T20:32:20.2443058Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.2443472Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.2443784Z         x = x_sign * x_clamp
2025-05-07T20:32:20.2444020Z         x0 = x[:, :D]
2025-05-07T20:32:20.2444240Z         x1 = x[:, D:]
2025-05-07T20:32:20.2444455Z     
2025-05-07T20:32:20.2444643Z         if contiguous:
2025-05-07T20:32:20.2444881Z             x0 = x0.contiguous()
2025-05-07T20:32:20.2445156Z             x1 = x1.contiguous()
2025-05-07T20:32:20.2445399Z     
2025-05-07T20:32:20.2445589Z         if scale_ub is not None:
2025-05-07T20:32:20.2445864Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.2446202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.2446507Z             )
2025-05-07T20:32:20.2446705Z         else:
2025-05-07T20:32:20.2446916Z             scale_ub_tensor = None
2025-05-07T20:32:20.2447165Z     
2025-05-07T20:32:20.2447400Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.2447716Z             op = silu_mul_quant
2025-05-07T20:32:20.2447962Z             if compiled:
2025-05-07T20:32:20.2448215Z                 op = torch.compile(op)
2025-05-07T20:32:20.2448512Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.2448789Z     
2025-05-07T20:32:20.2448988Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.2449154Z 
2025-05-07T20:32:20.2449337Z moe/activation_test.py:117: 
2025-05-07T20:32:20.2449636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.2449969Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.2450242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.2450803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.2451369Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.2452033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.2452720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.2453257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.2454016Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.2454686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.2455284Z     kernel = self.compile(
2025-05-07T20:32:20.2455827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.2456487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.2456881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.2457114Z 
2025-05-07T20:32:20.2457321Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1aa7d90>
2025-05-07T20:32:20.2458403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.2459795Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1afc720>}
2025-05-07T20:32:20.2461212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.2462244Z context = <triton._C.libtriton.ir.context object at 0x7f16f1a90430>
2025-05-07T20:32:20.2462536Z 
2025-05-07T20:32:20.2462703Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.2463227Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.2463689Z                            module_map=module_map)
2025-05-07T20:32:20.2464057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.2464413Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.2464676Z E       ^
2025-05-07T20:32:20.2465144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.2465605Z 
2025-05-07T20:32:20.2466025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.2466541Z 
2025-05-07T20:32:20.3799163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.3800055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.3800495Z     T=4096,
2025-05-07T20:32:20.3800685Z     D=5120,
2025-05-07T20:32:20.3800885Z     scale_ub=1200.0,
2025-05-07T20:32:20.3801116Z     contiguous=True,
2025-05-07T20:32:20.3801335Z     compiled=True,
2025-05-07T20:32:20.3801549Z )
2025-05-07T20:32:20.3801882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.3802376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:20.3802679Z 
2025-05-07T20:32:20.3802760Z     @given(
2025-05-07T20:32:20.3803477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.3803808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.3804123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.3804460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.3804798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.3805088Z     )
2025-05-07T20:32:20.3805446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.3805898Z     def test_silu_mul_quant(
2025-05-07T20:32:20.3806136Z         self,
2025-05-07T20:32:20.3806342Z         T: int,
2025-05-07T20:32:20.3806546Z         D: int,
2025-05-07T20:32:20.3806758Z         scale_ub: Optional[float],
2025-05-07T20:32:20.3807036Z         contiguous: bool,
2025-05-07T20:32:20.3807277Z         compiled: bool,
2025-05-07T20:32:20.3807503Z     ) -> None:
2025-05-07T20:32:20.3807836Z         torch.manual_seed(2025)
2025-05-07T20:32:20.3808198Z     
2025-05-07T20:32:20.3808500Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.3808893Z     
2025-05-07T20:32:20.3809100Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.3809416Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.3809768Z         x = x_sign * x_clamp
2025-05-07T20:32:20.3810034Z         x0 = x[:, :D]
2025-05-07T20:32:20.3810266Z         x1 = x[:, D:]
2025-05-07T20:32:20.3810484Z     
2025-05-07T20:32:20.3810686Z         if contiguous:
2025-05-07T20:32:20.3810939Z             x0 = x0.contiguous()
2025-05-07T20:32:20.3811219Z             x1 = x1.contiguous()
2025-05-07T20:32:20.3811486Z     
2025-05-07T20:32:20.3811696Z         if scale_ub is not None:
2025-05-07T20:32:20.3811994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.3812371Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.3812693Z             )
2025-05-07T20:32:20.3812891Z         else:
2025-05-07T20:32:20.3813118Z             scale_ub_tensor = None
2025-05-07T20:32:20.3813476Z     
2025-05-07T20:32:20.3813708Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.3814029Z             op = silu_mul_quant
2025-05-07T20:32:20.3814287Z             if compiled:
2025-05-07T20:32:20.3814532Z                 op = torch.compile(op)
2025-05-07T20:32:20.3814837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.3815117Z     
2025-05-07T20:32:20.3815320Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.3815483Z 
2025-05-07T20:32:20.3815585Z moe/activation_test.py:117: 
2025-05-07T20:32:20.3815882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.3816220Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.3816502Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.3817074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.3817645Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.3818313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.3819006Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.3819548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.3820235Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.3820900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.3821438Z     kernel = self.compile(
2025-05-07T20:32:20.3821988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.3822654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.3823108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.3823357Z 
2025-05-07T20:32:20.3823566Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f19312d0>
2025-05-07T20:32:20.3824645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.3826027Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1afd260>}
2025-05-07T20:32:20.3827358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.3828462Z context = <triton._C.libtriton.ir.context object at 0x7f16f192d930>
2025-05-07T20:32:20.3828756Z 
2025-05-07T20:32:20.3828971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.3829500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.3829995Z                            module_map=module_map)
2025-05-07T20:32:20.3830388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.3830768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.3831025Z E       ^
2025-05-07T20:32:20.3831494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.3831949Z 
2025-05-07T20:32:20.3832378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.3832906Z 
2025-05-07T20:32:20.3833014Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.3833432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.3833843Z     T=128,
2025-05-07T20:32:20.3834090Z     D=5120,
2025-05-07T20:32:20.3834279Z     scale_ub=1200.0,
2025-05-07T20:32:20.3834505Z     contiguous=False,
2025-05-07T20:32:20.3834731Z     compiled=True,
2025-05-07T20:32:20.3834926Z )
2025-05-07T20:32:20.6471307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.6472068Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:20.6472348Z 
2025-05-07T20:32:20.6472432Z     @given(
2025-05-07T20:32:20.6472680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.6473004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.6473323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.6473655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.6473994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.6474308Z     )
2025-05-07T20:32:20.6474666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.6475141Z     def test_silu_mul_quant(
2025-05-07T20:32:20.6475390Z         self,
2025-05-07T20:32:20.6475585Z         T: int,
2025-05-07T20:32:20.6475793Z         D: int,
2025-05-07T20:32:20.6476023Z         scale_ub: Optional[float],
2025-05-07T20:32:20.6476298Z         contiguous: bool,
2025-05-07T20:32:20.6476545Z         compiled: bool,
2025-05-07T20:32:20.6476781Z     ) -> None:
2025-05-07T20:32:20.6476998Z         torch.manual_seed(2025)
2025-05-07T20:32:20.6477243Z     
2025-05-07T20:32:20.6477523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.6477875Z     
2025-05-07T20:32:20.6478073Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.6478371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.6478687Z         x = x_sign * x_clamp
2025-05-07T20:32:20.6478929Z         x0 = x[:, :D]
2025-05-07T20:32:20.6479155Z         x1 = x[:, D:]
2025-05-07T20:32:20.6479375Z     
2025-05-07T20:32:20.6479875Z         if contiguous:
2025-05-07T20:32:20.6480127Z             x0 = x0.contiguous()
2025-05-07T20:32:20.6480393Z             x1 = x1.contiguous()
2025-05-07T20:32:20.6480634Z     
2025-05-07T20:32:20.6480837Z         if scale_ub is not None:
2025-05-07T20:32:20.6481119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.6481457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.6481771Z             )
2025-05-07T20:32:20.6481972Z         else:
2025-05-07T20:32:20.6482186Z             scale_ub_tensor = None
2025-05-07T20:32:20.6482448Z     
2025-05-07T20:32:20.6482688Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6483005Z             op = silu_mul_quant
2025-05-07T20:32:20.6483428Z             if compiled:
2025-05-07T20:32:20.6483703Z                 op = torch.compile(op)
2025-05-07T20:32:20.6484100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6484374Z     
2025-05-07T20:32:20.6484659Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.6484826Z 
2025-05-07T20:32:20.6484933Z moe/activation_test.py:117: 
2025-05-07T20:32:20.6485226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6485566Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.6485855Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6486421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.6486992Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.6487660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.6488358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.6488898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.6489596Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.6490421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.6490964Z     kernel = self.compile(
2025-05-07T20:32:20.6491513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.6492184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.6492587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6492816Z 
2025-05-07T20:32:20.6493028Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f197a590>
2025-05-07T20:32:20.6494114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.6495499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1afe480>}
2025-05-07T20:32:20.6496857Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.6497889Z context = <triton._C.libtriton.ir.context object at 0x7f16f19d2bb0>
2025-05-07T20:32:20.6498177Z 
2025-05-07T20:32:20.6498346Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.6498874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.6499351Z                            module_map=module_map)
2025-05-07T20:32:20.6499715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.6500079Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.6500348Z E       ^
2025-05-07T20:32:20.6500876Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.6501332Z 
2025-05-07T20:32:20.6501753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.6502273Z 
2025-05-07T20:32:20.6502379Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.6502795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.6503203Z     T=16384,
2025-05-07T20:32:20.6503399Z     D=7168,
2025-05-07T20:32:20.6503598Z     scale_ub=1200.0,
2025-05-07T20:32:20.6503832Z     contiguous=True,
2025-05-07T20:32:20.6504058Z     compiled=True,
2025-05-07T20:32:20.6504270Z )
2025-05-07T20:32:20.6504597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.6505139Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:20.6505466Z 
2025-05-07T20:32:20.6505549Z     @given(
2025-05-07T20:32:20.6505782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.6506093Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.6506406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.6506736Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.6507065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.6507351Z     )
2025-05-07T20:32:20.6507702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.6508147Z     def test_silu_mul_quant(
2025-05-07T20:32:20.6508385Z         self,
2025-05-07T20:32:20.6508588Z         T: int,
2025-05-07T20:32:20.6508792Z         D: int,
2025-05-07T20:32:20.6509012Z         scale_ub: Optional[float],
2025-05-07T20:32:20.6509295Z         contiguous: bool,
2025-05-07T20:32:20.6509540Z         compiled: bool,
2025-05-07T20:32:20.6509765Z     ) -> None:
2025-05-07T20:32:20.6510016Z         torch.manual_seed(2025)
2025-05-07T20:32:20.6510345Z     
2025-05-07T20:32:20.6510618Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.6510969Z     
2025-05-07T20:32:20.6511168Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.6511461Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.6511826Z         x = x_sign * x_clamp
2025-05-07T20:32:20.6512169Z         x0 = x[:, :D]
2025-05-07T20:32:20.6512428Z         x1 = x[:, D:]
2025-05-07T20:32:20.6512633Z     
2025-05-07T20:32:20.6512827Z         if contiguous:
2025-05-07T20:32:20.6513066Z             x0 = x0.contiguous()
2025-05-07T20:32:20.6513326Z             x1 = x1.contiguous()
2025-05-07T20:32:20.6513570Z     
2025-05-07T20:32:20.6513765Z         if scale_ub is not None:
2025-05-07T20:32:20.6514035Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.6514376Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.6514691Z             )
2025-05-07T20:32:20.6514893Z         else:
2025-05-07T20:32:20.6515112Z             scale_ub_tensor = None
2025-05-07T20:32:20.6515374Z     
2025-05-07T20:32:20.6515605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6515924Z             op = silu_mul_quant
2025-05-07T20:32:20.6516177Z             if compiled:
2025-05-07T20:32:20.6516424Z                 op = torch.compile(op)
2025-05-07T20:32:20.6516730Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6517012Z     
2025-05-07T20:32:20.6517211Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.6517375Z 
2025-05-07T20:32:20.6517475Z moe/activation_test.py:117: 
2025-05-07T20:32:20.6517773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6518110Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.6518389Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6519014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.6519587Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.6520248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.6520944Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.6521497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.6522196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.6522863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.6523515Z     kernel = self.compile(
2025-05-07T20:32:20.6524075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.6524790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.6525232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6525472Z 
2025-05-07T20:32:20.6525688Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f17b7a90>
2025-05-07T20:32:20.6526769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.6528136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1affd80>}
2025-05-07T20:32:20.6529486Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.6530520Z context = <triton._C.libtriton.ir.context object at 0x7f16f17f00f0>
2025-05-07T20:32:20.6530860Z 
2025-05-07T20:32:20.6531030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.6531561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.6532026Z                            module_map=module_map)
2025-05-07T20:32:20.6532398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.6532764Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.6533026Z E       ^
2025-05-07T20:32:20.6533497Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.6533958Z 
2025-05-07T20:32:20.6534378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.6534894Z 
2025-05-07T20:32:20.7494894Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.7496230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.7497108Z     T=16384,
2025-05-07T20:32:20.7497502Z     D=5120,
2025-05-07T20:32:20.7497890Z     scale_ub=1200.0,
2025-05-07T20:32:20.7498330Z     contiguous=True,
2025-05-07T20:32:20.7498786Z     compiled=False,
2025-05-07T20:32:20.7499198Z )
2025-05-07T20:32:20.7499839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.7500405Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:20.7500688Z 
2025-05-07T20:32:20.7500780Z     @given(
2025-05-07T20:32:20.7501013Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.7501336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.7501651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.7501990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.7510408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.7510760Z     )
2025-05-07T20:32:20.7511423Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.7511873Z     def test_silu_mul_quant(
2025-05-07T20:32:20.7512122Z         self,
2025-05-07T20:32:20.7512318Z         T: int,
2025-05-07T20:32:20.7512519Z         D: int,
2025-05-07T20:32:20.7512743Z         scale_ub: Optional[float],
2025-05-07T20:32:20.7513017Z         contiguous: bool,
2025-05-07T20:32:20.7513350Z         compiled: bool,
2025-05-07T20:32:20.7513655Z     ) -> None:
2025-05-07T20:32:20.7513954Z         torch.manual_seed(2025)
2025-05-07T20:32:20.7514197Z     
2025-05-07T20:32:20.7514490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.7514847Z     
2025-05-07T20:32:20.7515042Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.7515343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.7515770Z         x = x_sign * x_clamp
2025-05-07T20:32:20.7516015Z         x0 = x[:, :D]
2025-05-07T20:32:20.7516318Z         x1 = x[:, D:]
2025-05-07T20:32:20.7516538Z     
2025-05-07T20:32:20.7516729Z         if contiguous:
2025-05-07T20:32:20.7516972Z             x0 = x0.contiguous()
2025-05-07T20:32:20.7517237Z             x1 = x1.contiguous()
2025-05-07T20:32:20.7517475Z     
2025-05-07T20:32:20.7517675Z         if scale_ub is not None:
2025-05-07T20:32:20.7517957Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.7518291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.7518604Z             )
2025-05-07T20:32:20.7518804Z         else:
2025-05-07T20:32:20.7519020Z             scale_ub_tensor = None
2025-05-07T20:32:20.7519268Z     
2025-05-07T20:32:20.7519508Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.7519828Z             op = silu_mul_quant
2025-05-07T20:32:20.7520100Z             if compiled:
2025-05-07T20:32:20.7520385Z                 op = torch.compile(op)
2025-05-07T20:32:20.7520690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.7521059Z     
2025-05-07T20:32:20.7521257Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.7521424Z 
2025-05-07T20:32:20.7521537Z moe/activation_test.py:117: 
2025-05-07T20:32:20.7521830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7522171Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.7522460Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.7523156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.7524013Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.7524564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.7525268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.7525948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.7526496Z     kernel = self.compile(
2025-05-07T20:32:20.7527052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.7527722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.7528116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7528356Z 
2025-05-07T20:32:20.7528566Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1730c90>
2025-05-07T20:32:20.7529657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.7531098Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1774cc0>}
2025-05-07T20:32:20.7532445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.7533484Z context = <triton._C.libtriton.ir.context object at 0x7f16f17ad270>
2025-05-07T20:32:20.7533782Z 
2025-05-07T20:32:20.7533951Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.7534487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.7534953Z                            module_map=module_map)
2025-05-07T20:32:20.7535326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.7535691Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.7535959Z E       ^
2025-05-07T20:32:20.7536476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.7536975Z 
2025-05-07T20:32:20.7537397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.7537917Z 
2025-05-07T20:32:20.7538031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.7538761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.7539178Z     T=1,
2025-05-07T20:32:20.7539368Z     D=7168,
2025-05-07T20:32:20.7539568Z     scale_ub=1200.0,
2025-05-07T20:32:20.7539789Z     contiguous=False,
2025-05-07T20:32:20.7540023Z     compiled=False,
2025-05-07T20:32:20.7540230Z )
2025-05-07T20:32:20.7540554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.7541054Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:20.7541327Z 
2025-05-07T20:32:20.7541410Z     @given(
2025-05-07T20:32:20.7541649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.7542049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.7542351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.7542681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.7543008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.7543290Z     )
2025-05-07T20:32:20.7543639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.7544082Z     def test_silu_mul_quant(
2025-05-07T20:32:20.7544319Z         self,
2025-05-07T20:32:20.7544510Z         T: int,
2025-05-07T20:32:20.7544709Z         D: int,
2025-05-07T20:32:20.7544929Z         scale_ub: Optional[float],
2025-05-07T20:32:20.7545198Z         contiguous: bool,
2025-05-07T20:32:20.7545437Z         compiled: bool,
2025-05-07T20:32:20.7545662Z     ) -> None:
2025-05-07T20:32:20.7545870Z         torch.manual_seed(2025)
2025-05-07T20:32:20.7546118Z     
2025-05-07T20:32:20.7546394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.7546738Z     
2025-05-07T20:32:20.7546939Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.7547236Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.7547544Z         x = x_sign * x_clamp
2025-05-07T20:32:20.7547787Z         x0 = x[:, :D]
2025-05-07T20:32:20.7548008Z         x1 = x[:, D:]
2025-05-07T20:32:20.7548209Z     
2025-05-07T20:32:20.7548398Z         if contiguous:
2025-05-07T20:32:20.7548654Z             x0 = x0.contiguous()
2025-05-07T20:32:20.7548917Z             x1 = x1.contiguous()
2025-05-07T20:32:20.7549155Z     
2025-05-07T20:32:20.7549347Z         if scale_ub is not None:
2025-05-07T20:32:20.7549614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.7549941Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.7550253Z             )
2025-05-07T20:32:20.7550444Z         else:
2025-05-07T20:32:20.7550655Z             scale_ub_tensor = None
2025-05-07T20:32:20.7550990Z     
2025-05-07T20:32:20.7551235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.7551544Z             op = silu_mul_quant
2025-05-07T20:32:20.7551801Z             if compiled:
2025-05-07T20:32:20.7552052Z                 op = torch.compile(op)
2025-05-07T20:32:20.7552351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.7552622Z     
2025-05-07T20:32:20.7552820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.7552984Z 
2025-05-07T20:32:20.7553093Z moe/activation_test.py:117: 
2025-05-07T20:32:20.7553380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7553715Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.7554007Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.7554698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.7555466Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.7556066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.7556761Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.7557427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.7557969Z     kernel = self.compile(
2025-05-07T20:32:20.7558515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.7559181Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.7559572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7559807Z 
2025-05-07T20:32:20.7560038Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f17cc250>
2025-05-07T20:32:20.7561151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.7562567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1775080>}
2025-05-07T20:32:20.7564047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.7565081Z context = <triton._C.libtriton.ir.context object at 0x7f16f1798870>
2025-05-07T20:32:20.7565371Z 
2025-05-07T20:32:20.7565539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.7566064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.7566531Z                            module_map=module_map)
2025-05-07T20:32:20.7566903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.7567257Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.7567514Z E       ^
2025-05-07T20:32:20.7567979Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.7568441Z 
2025-05-07T20:32:20.7568863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.7569378Z 
2025-05-07T20:32:20.8909074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.8909591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.8910014Z     T=4096,
2025-05-07T20:32:20.8910215Z     D=7168,
2025-05-07T20:32:20.8910418Z     scale_ub=1200.0,
2025-05-07T20:32:20.8910650Z     contiguous=False,
2025-05-07T20:32:20.8910909Z     compiled=True,
2025-05-07T20:32:20.8911130Z )
2025-05-07T20:32:20.8911767Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.8912270Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:20.8912553Z 
2025-05-07T20:32:20.8912633Z     @given(
2025-05-07T20:32:20.8912875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.8913188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.8913502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.8913839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.8914165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.8914459Z     )
2025-05-07T20:32:20.8914817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.8915267Z     def test_silu_mul_quant(
2025-05-07T20:32:20.8915508Z         self,
2025-05-07T20:32:20.8915802Z         T: int,
2025-05-07T20:32:20.8916008Z         D: int,
2025-05-07T20:32:20.8916340Z         scale_ub: Optional[float],
2025-05-07T20:32:20.8916626Z         contiguous: bool,
2025-05-07T20:32:20.8916880Z         compiled: bool,
2025-05-07T20:32:20.8917106Z     ) -> None:
2025-05-07T20:32:20.8917334Z         torch.manual_seed(2025)
2025-05-07T20:32:20.8917584Z     
2025-05-07T20:32:20.8917855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.8918208Z     
2025-05-07T20:32:20.8918407Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.8918696Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.8919008Z         x = x_sign * x_clamp
2025-05-07T20:32:20.8919254Z         x0 = x[:, :D]
2025-05-07T20:32:20.8919468Z         x1 = x[:, D:]
2025-05-07T20:32:20.8919683Z     
2025-05-07T20:32:20.8919874Z         if contiguous:
2025-05-07T20:32:20.8920126Z             x0 = x0.contiguous()
2025-05-07T20:32:20.8920441Z             x1 = x1.contiguous()
2025-05-07T20:32:20.8920700Z     
2025-05-07T20:32:20.8920897Z         if scale_ub is not None:
2025-05-07T20:32:20.8921280Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.8921623Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.8921931Z             )
2025-05-07T20:32:20.8922129Z         else:
2025-05-07T20:32:20.8922343Z             scale_ub_tensor = None
2025-05-07T20:32:20.8922598Z     
2025-05-07T20:32:20.8922836Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.8923156Z             op = silu_mul_quant
2025-05-07T20:32:20.8923572Z             if compiled:
2025-05-07T20:32:20.8923817Z                 op = torch.compile(op)
2025-05-07T20:32:20.8924119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.8924397Z     
2025-05-07T20:32:20.8924589Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.8924768Z 
2025-05-07T20:32:20.8924871Z moe/activation_test.py:117: 
2025-05-07T20:32:20.8925178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.8925516Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.8925810Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.8926374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.8926936Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.8927595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.8928288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.8928832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.8929525Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.8930244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.8930786Z     kernel = self.compile(
2025-05-07T20:32:20.8931388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.8932053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.8932454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.8932683Z 
2025-05-07T20:32:20.8932899Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1876a90>
2025-05-07T20:32:20.8933976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.8935355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1777060>}
2025-05-07T20:32:20.8936745Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.8937821Z context = <triton._C.libtriton.ir.context object at 0x7f16f18470f0>
2025-05-07T20:32:20.8938109Z 
2025-05-07T20:32:20.8938284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.8939138Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.8939613Z                            module_map=module_map)
2025-05-07T20:32:20.8939987Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.8940353Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.8940611Z E       ^
2025-05-07T20:32:20.8941085Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.8941535Z 
2025-05-07T20:32:20.8941972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.8942567Z 
2025-05-07T20:32:20.8942673Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.8943097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.8943506Z     T=128,
2025-05-07T20:32:20.8943701Z     D=7168,
2025-05-07T20:32:20.8943896Z     scale_ub=1200.0,
2025-05-07T20:32:20.8944134Z     contiguous=False,
2025-05-07T20:32:20.8944363Z     compiled=True,
2025-05-07T20:32:20.8944568Z )
2025-05-07T20:32:20.9673486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.9674175Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:20.9674451Z 
2025-05-07T20:32:20.9674545Z     @given(
2025-05-07T20:32:20.9674779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.9675111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.9675435Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.9675782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.9676115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.9676408Z     )
2025-05-07T20:32:20.9676757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.9677205Z     def test_silu_mul_quant(
2025-05-07T20:32:20.9677452Z         self,
2025-05-07T20:32:20.9677649Z         T: int,
2025-05-07T20:32:20.9677851Z         D: int,
2025-05-07T20:32:20.9678081Z         scale_ub: Optional[float],
2025-05-07T20:32:20.9678362Z         contiguous: bool,
2025-05-07T20:32:20.9678604Z         compiled: bool,
2025-05-07T20:32:20.9678836Z     ) -> None:
2025-05-07T20:32:20.9679057Z         torch.manual_seed(2025)
2025-05-07T20:32:20.9679300Z     
2025-05-07T20:32:20.9679582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.9679932Z     
2025-05-07T20:32:20.9680363Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.9680672Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.9680990Z         x = x_sign * x_clamp
2025-05-07T20:32:20.9681227Z         x0 = x[:, :D]
2025-05-07T20:32:20.9681452Z         x1 = x[:, D:]
2025-05-07T20:32:20.9681671Z     
2025-05-07T20:32:20.9681861Z         if contiguous:
2025-05-07T20:32:20.9682103Z             x0 = x0.contiguous()
2025-05-07T20:32:20.9682373Z             x1 = x1.contiguous()
2025-05-07T20:32:20.9682616Z     
2025-05-07T20:32:20.9682815Z         if scale_ub is not None:
2025-05-07T20:32:20.9683095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.9683601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.9683918Z             )
2025-05-07T20:32:20.9684118Z         else:
2025-05-07T20:32:20.9684331Z             scale_ub_tensor = None
2025-05-07T20:32:20.9684585Z     
2025-05-07T20:32:20.9684911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.9685313Z             op = silu_mul_quant
2025-05-07T20:32:20.9685566Z             if compiled:
2025-05-07T20:32:20.9685818Z                 op = torch.compile(op)
2025-05-07T20:32:20.9686123Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.9686403Z     
2025-05-07T20:32:20.9686604Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.9686772Z 
2025-05-07T20:32:20.9686882Z moe/activation_test.py:117: 
2025-05-07T20:32:20.9687175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.9687518Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.9687806Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.9688374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.9688939Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.9689607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.9690390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.9690929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.9691618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.9692290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.9692830Z     kernel = self.compile(
2025-05-07T20:32:20.9693373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.9694045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.9694450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.9694679Z 
2025-05-07T20:32:20.9694899Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f18afd50>
2025-05-07T20:32:20.9695980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.9697364Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f188c360>}
2025-05-07T20:32:20.9698706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.9699736Z context = <triton._C.libtriton.ir.context object at 0x7f16f18e03b0>
2025-05-07T20:32:20.9700027Z 
2025-05-07T20:32:20.9700197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.9700773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.9701259Z                            module_map=module_map)
2025-05-07T20:32:20.9701630Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.9701988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.9702256Z E       ^
2025-05-07T20:32:20.9702726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.9703178Z 
2025-05-07T20:32:20.9703598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.9704120Z 
2025-05-07T20:32:20.9704226Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.9704647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.9705060Z     T=2048,
2025-05-07T20:32:20.9705249Z     D=7168,
2025-05-07T20:32:20.9705496Z     scale_ub=None,
2025-05-07T20:32:20.9705720Z     contiguous=True,
2025-05-07T20:32:20.9705986Z     compiled=True,
2025-05-07T20:32:20.9706200Z )
2025-05-07T20:32:20.9706531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.9707020Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.9707296Z 
2025-05-07T20:32:20.9707378Z     @given(
2025-05-07T20:32:20.9707615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.9707931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.9708240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.9708569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.9708900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.9709188Z     )
2025-05-07T20:32:20.9709543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.9709993Z     def test_silu_mul_quant(
2025-05-07T20:32:20.9710267Z         self,
2025-05-07T20:32:20.9710485Z         T: int,
2025-05-07T20:32:20.9710743Z         D: int,
2025-05-07T20:32:20.9710962Z         scale_ub: Optional[float],
2025-05-07T20:32:20.9711240Z         contiguous: bool,
2025-05-07T20:32:20.9711483Z         compiled: bool,
2025-05-07T20:32:20.9711711Z     ) -> None:
2025-05-07T20:32:20.9711932Z         torch.manual_seed(2025)
2025-05-07T20:32:20.9712174Z     
2025-05-07T20:32:20.9712446Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.9712792Z     
2025-05-07T20:32:20.9712991Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.9713287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.9713596Z         x = x_sign * x_clamp
2025-05-07T20:32:20.9713839Z         x0 = x[:, :D]
2025-05-07T20:32:20.9714061Z         x1 = x[:, D:]
2025-05-07T20:32:20.9714269Z     
2025-05-07T20:32:20.9714466Z         if contiguous:
2025-05-07T20:32:20.9714708Z             x0 = x0.contiguous()
2025-05-07T20:32:20.9714969Z             x1 = x1.contiguous()
2025-05-07T20:32:20.9715222Z     
2025-05-07T20:32:20.9715424Z         if scale_ub is not None:
2025-05-07T20:32:20.9715699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.9716041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.9716356Z             )
2025-05-07T20:32:20.9716552Z         else:
2025-05-07T20:32:20.9716771Z             scale_ub_tensor = None
2025-05-07T20:32:20.9717034Z     
2025-05-07T20:32:20.9717265Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.9717588Z             op = silu_mul_quant
2025-05-07T20:32:20.9717847Z             if compiled:
2025-05-07T20:32:20.9718101Z                 op = torch.compile(op)
2025-05-07T20:32:20.9718401Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.9718694Z     
2025-05-07T20:32:20.9718893Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.9719058Z 
2025-05-07T20:32:20.9719161Z moe/activation_test.py:117: 
2025-05-07T20:32:20.9719513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.9719857Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.9727871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.9728465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.9729039Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.9729693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.9730384Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.9730928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.9731607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.9732363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.9732953Z     kernel = self.compile(
2025-05-07T20:32:20.9733511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.9734173Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.9734580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.9734817Z 
2025-05-07T20:32:20.9735025Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1545310>
2025-05-07T20:32:20.9736108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.9737471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f188cea0>}
2025-05-07T20:32:20.9739108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.9740287Z context = <triton._C.libtriton.ir.context object at 0x7f16f15b1930>
2025-05-07T20:32:20.9740581Z 
2025-05-07T20:32:20.9740761Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.9741281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.9741758Z                            module_map=module_map)
2025-05-07T20:32:20.9742132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.9742494Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.9742748Z E       ^
2025-05-07T20:32:20.9743221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.9743674Z 
2025-05-07T20:32:20.9744108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.9744639Z 
2025-05-07T20:32:21.0404613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.0405948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.0406752Z     T=16384,
2025-05-07T20:32:21.0407149Z     D=5120,
2025-05-07T20:32:21.0407537Z     scale_ub=None,
2025-05-07T20:32:21.0407958Z     contiguous=False,
2025-05-07T20:32:21.0408413Z     compiled=False,
2025-05-07T20:32:21.0408817Z )
2025-05-07T20:32:21.0409447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.0410240Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:21.0410524Z 
2025-05-07T20:32:21.0410613Z     @given(
2025-05-07T20:32:21.0410862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.0411392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.0411724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.0412061Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.0412386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.0412674Z     )
2025-05-07T20:32:21.0413023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.0413461Z     def test_silu_mul_quant(
2025-05-07T20:32:21.0413713Z         self,
2025-05-07T20:32:21.0413916Z         T: int,
2025-05-07T20:32:21.0414116Z         D: int,
2025-05-07T20:32:21.0414343Z         scale_ub: Optional[float],
2025-05-07T20:32:21.0414623Z         contiguous: bool,
2025-05-07T20:32:21.0414865Z         compiled: bool,
2025-05-07T20:32:21.0415099Z     ) -> None:
2025-05-07T20:32:21.0415322Z         torch.manual_seed(2025)
2025-05-07T20:32:21.0415558Z     
2025-05-07T20:32:21.0415924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.0416347Z     
2025-05-07T20:32:21.0416555Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.0416846Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.0418874Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.0420762Z 
2025-05-07T20:32:21.0420885Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:21.0421112Z 
2025-05-07T20:32:21.0421222Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.0421647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.0422130Z     T=4096,
2025-05-07T20:32:21.0422330Z     D=7168,
2025-05-07T20:32:21.0422530Z     scale_ub=1200.0,
2025-05-07T20:32:21.0422757Z     contiguous=True,
2025-05-07T20:32:21.0422992Z     compiled=True,
2025-05-07T20:32:21.0423202Z )
2025-05-07T20:32:21.0423526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.0424018Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.0424303Z 
2025-05-07T20:32:21.0424382Z     @given(
2025-05-07T20:32:21.0424616Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.0424929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.0425244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.0425579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.0425906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.0426207Z     )
2025-05-07T20:32:21.0426565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.0427006Z     def test_silu_mul_quant(
2025-05-07T20:32:21.0427252Z         self,
2025-05-07T20:32:21.0427454Z         T: int,
2025-05-07T20:32:21.0427649Z         D: int,
2025-05-07T20:32:21.0427872Z         scale_ub: Optional[float],
2025-05-07T20:32:21.0428150Z         contiguous: bool,
2025-05-07T20:32:21.0428389Z         compiled: bool,
2025-05-07T20:32:21.0428616Z     ) -> None:
2025-05-07T20:32:21.0428834Z         torch.manual_seed(2025)
2025-05-07T20:32:21.0429078Z     
2025-05-07T20:32:21.0429351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.0429697Z     
2025-05-07T20:32:21.0429891Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.0430185Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.0432228Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.0434097Z 
2025-05-07T20:32:21.0434217Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:21.0434430Z 
2025-05-07T20:32:21.0434537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.0434941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.0435347Z     T=16384,
2025-05-07T20:32:21.0435542Z     D=7168,
2025-05-07T20:32:21.0435730Z     scale_ub=None,
2025-05-07T20:32:21.0435991Z     contiguous=False,
2025-05-07T20:32:21.0436226Z     compiled=False,
2025-05-07T20:32:21.0436473Z )
2025-05-07T20:32:21.0436792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.0437291Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:21.0437569Z 
2025-05-07T20:32:21.0437654Z     @given(
2025-05-07T20:32:21.0437880Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.0438192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.0438818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.0439156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.0439492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.0439787Z     )
2025-05-07T20:32:21.0440134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.0440624Z     def test_silu_mul_quant(
2025-05-07T20:32:21.0440864Z         self,
2025-05-07T20:32:21.0441068Z         T: int,
2025-05-07T20:32:21.0441269Z         D: int,
2025-05-07T20:32:21.0441586Z         scale_ub: Optional[float],
2025-05-07T20:32:21.0441865Z         contiguous: bool,
2025-05-07T20:32:21.0442101Z         compiled: bool,
2025-05-07T20:32:21.0442332Z     ) -> None:
2025-05-07T20:32:21.0442552Z         torch.manual_seed(2025)
2025-05-07T20:32:21.0442791Z     
2025-05-07T20:32:21.0443073Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.0445216Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.0447091Z 
2025-05-07T20:32:21.0447220Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.0447431Z 
2025-05-07T20:32:21.0447543Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.0447950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.0448356Z     T=2048,
2025-05-07T20:32:21.0448554Z     D=7168,
2025-05-07T20:32:21.0448760Z     scale_ub=1200.0,
2025-05-07T20:32:21.0448986Z     contiguous=True,
2025-05-07T20:32:21.0449215Z     compiled=True,
2025-05-07T20:32:21.0449416Z )
2025-05-07T20:32:21.0449738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.0450234Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.0450505Z 
2025-05-07T20:32:21.0450589Z     @given(
2025-05-07T20:32:21.0450814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.0451132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.0451513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.0451842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.0452176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.0452467Z     )
2025-05-07T20:32:21.0452811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.0453257Z     def test_silu_mul_quant(
2025-05-07T20:32:21.0453502Z         self,
2025-05-07T20:32:21.0453698Z         T: int,
2025-05-07T20:32:21.0453893Z         D: int,
2025-05-07T20:32:21.0454115Z         scale_ub: Optional[float],
2025-05-07T20:32:21.0454388Z         contiguous: bool,
2025-05-07T20:32:21.0454623Z         compiled: bool,
2025-05-07T20:32:21.0454849Z     ) -> None:
2025-05-07T20:32:21.0455066Z         torch.manual_seed(2025)
2025-05-07T20:32:21.0455302Z     
2025-05-07T20:32:21.0455642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.0455989Z     
2025-05-07T20:32:21.0456238Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.0456536Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.0458522Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.0460416Z 
2025-05-07T20:32:21.0460541Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:21.0460752Z 
2025-05-07T20:32:21.0460862Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.0461275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.0461760Z     T=2048,
2025-05-07T20:32:21.0461956Z     D=7168,
2025-05-07T20:32:21.0462146Z     scale_ub=None,
2025-05-07T20:32:21.0462365Z     contiguous=True,
2025-05-07T20:32:21.0462591Z     compiled=False,
2025-05-07T20:32:21.0462794Z )
2025-05-07T20:32:21.1326519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.1327247Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.1327525Z 
2025-05-07T20:32:21.1327616Z     @given(
2025-05-07T20:32:21.1327857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.1328175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.1328497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.1328836Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.1329168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.1329480Z     )
2025-05-07T20:32:21.1329850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.1330307Z     def test_silu_mul_quant(
2025-05-07T20:32:21.1330555Z         self,
2025-05-07T20:32:21.1330758Z         T: int,
2025-05-07T20:32:21.1330958Z         D: int,
2025-05-07T20:32:21.1331182Z         scale_ub: Optional[float],
2025-05-07T20:32:21.1331463Z         contiguous: bool,
2025-05-07T20:32:21.1331705Z         compiled: bool,
2025-05-07T20:32:21.1331936Z     ) -> None:
2025-05-07T20:32:21.1332159Z         torch.manual_seed(2025)
2025-05-07T20:32:21.1332418Z     
2025-05-07T20:32:21.1332704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.1333054Z     
2025-05-07T20:32:21.1333253Z >       x_sign = torch.sign(x)
2025-05-07T20:32:21.1335462Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.1337331Z 
2025-05-07T20:32:21.1337452Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:21.1337676Z 
2025-05-07T20:32:21.1337783Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.1338206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.1338910Z     T=1,
2025-05-07T20:32:21.1339104Z     D=7168,
2025-05-07T20:32:21.1339308Z     scale_ub=1200.0,
2025-05-07T20:32:21.1339534Z     contiguous=True,
2025-05-07T20:32:21.1339770Z     compiled=False,
2025-05-07T20:32:21.1339983Z )
2025-05-07T20:32:21.1340434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.1341010Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.1341281Z 
2025-05-07T20:32:21.1341374Z     @given(
2025-05-07T20:32:21.1341609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.1341930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.1342244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.1342581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.1342911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.1343204Z     )
2025-05-07T20:32:21.1343563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.1344005Z     def test_silu_mul_quant(
2025-05-07T20:32:21.1344250Z         self,
2025-05-07T20:32:21.1344452Z         T: int,
2025-05-07T20:32:21.1344648Z         D: int,
2025-05-07T20:32:21.1344877Z         scale_ub: Optional[float],
2025-05-07T20:32:21.1345154Z         contiguous: bool,
2025-05-07T20:32:21.1345477Z         compiled: bool,
2025-05-07T20:32:21.1345704Z     ) -> None:
2025-05-07T20:32:21.1345931Z         torch.manual_seed(2025)
2025-05-07T20:32:21.1346168Z     
2025-05-07T20:32:21.1346449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.1346801Z     
2025-05-07T20:32:21.1347001Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.1347293Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.1347610Z         x = x_sign * x_clamp
2025-05-07T20:32:21.1347854Z         x0 = x[:, :D]
2025-05-07T20:32:21.1348071Z         x1 = x[:, D:]
2025-05-07T20:32:21.1348286Z     
2025-05-07T20:32:21.1348479Z         if contiguous:
2025-05-07T20:32:21.1348714Z             x0 = x0.contiguous()
2025-05-07T20:32:21.1348989Z             x1 = x1.contiguous()
2025-05-07T20:32:21.1349241Z     
2025-05-07T20:32:21.1349439Z         if scale_ub is not None:
2025-05-07T20:32:21.1349724Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.1350077Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.1350419Z             )
2025-05-07T20:32:21.1350645Z         else:
2025-05-07T20:32:21.1350864Z             scale_ub_tensor = None
2025-05-07T20:32:21.1351117Z     
2025-05-07T20:32:21.1351360Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.1351687Z             op = silu_mul_quant
2025-05-07T20:32:21.1351947Z             if compiled:
2025-05-07T20:32:21.1352202Z                 op = torch.compile(op)
2025-05-07T20:32:21.1352509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1352800Z     
2025-05-07T20:32:21.1352996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.1353174Z 
2025-05-07T20:32:21.1353278Z moe/activation_test.py:117: 
2025-05-07T20:32:21.1353582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1353925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.1354291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1355001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.1355706Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.1356251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.1356946Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.1357619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.1358156Z     kernel = self.compile(
2025-05-07T20:32:21.1358713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.1359382Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.1359835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1360109Z 
2025-05-07T20:32:21.1360323Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f16bf450>
2025-05-07T20:32:21.1361413Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.1362781Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1644680>}
2025-05-07T20:32:21.1364235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.1365273Z context = <triton._C.libtriton.ir.context object at 0x7f16f169fa70>
2025-05-07T20:32:21.1365567Z 
2025-05-07T20:32:21.1365747Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.1366324Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.1366804Z                            module_map=module_map)
2025-05-07T20:32:21.1367171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.1367535Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.1367799Z E       ^
2025-05-07T20:32:21.1368267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.1368726Z 
2025-05-07T20:32:21.1369150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.1369676Z 
2025-05-07T20:32:21.1369782Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.1370204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.1370614Z     T=128,
2025-05-07T20:32:21.1370813Z     D=5120,
2025-05-07T20:32:21.1371016Z     scale_ub=None,
2025-05-07T20:32:21.1371238Z     contiguous=True,
2025-05-07T20:32:21.1371469Z     compiled=False,
2025-05-07T20:32:21.1371683Z )
2025-05-07T20:32:21.3721941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.3722788Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.3723202Z 
2025-05-07T20:32:21.3723426Z     @given(
2025-05-07T20:32:21.3723739Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.3724064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.3724393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.3724743Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.3725081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.3725390Z     )
2025-05-07T20:32:21.3726009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.3726487Z     def test_silu_mul_quant(
2025-05-07T20:32:21.3726733Z         self,
2025-05-07T20:32:21.3726937Z         T: int,
2025-05-07T20:32:21.3727141Z         D: int,
2025-05-07T20:32:21.3727359Z         scale_ub: Optional[float],
2025-05-07T20:32:21.3727644Z         contiguous: bool,
2025-05-07T20:32:21.3727893Z         compiled: bool,
2025-05-07T20:32:21.3728118Z     ) -> None:
2025-05-07T20:32:21.3728342Z         torch.manual_seed(2025)
2025-05-07T20:32:21.3728594Z     
2025-05-07T20:32:21.3728868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.3729217Z     
2025-05-07T20:32:21.3729414Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.3729706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.3730021Z         x = x_sign * x_clamp
2025-05-07T20:32:21.3730265Z         x0 = x[:, :D]
2025-05-07T20:32:21.3730595Z         x1 = x[:, D:]
2025-05-07T20:32:21.3730814Z     
2025-05-07T20:32:21.3731074Z         if contiguous:
2025-05-07T20:32:21.3731311Z             x0 = x0.contiguous()
2025-05-07T20:32:21.3731586Z             x1 = x1.contiguous()
2025-05-07T20:32:21.3731834Z     
2025-05-07T20:32:21.3732033Z         if scale_ub is not None:
2025-05-07T20:32:21.3732307Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.3732652Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.3732967Z             )
2025-05-07T20:32:21.3733159Z         else:
2025-05-07T20:32:21.3733372Z             scale_ub_tensor = None
2025-05-07T20:32:21.3733628Z     
2025-05-07T20:32:21.3733864Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.3734188Z             op = silu_mul_quant
2025-05-07T20:32:21.3734445Z             if compiled:
2025-05-07T20:32:21.3734694Z                 op = torch.compile(op)
2025-05-07T20:32:21.3734998Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.3735280Z     
2025-05-07T20:32:21.3735478Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.3735729Z 
2025-05-07T20:32:21.3735837Z moe/activation_test.py:117: 
2025-05-07T20:32:21.3736143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.3736486Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.3736770Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.3737474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.3738178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.3739038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.3739742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.3740461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.3741023Z     kernel = self.compile(
2025-05-07T20:32:21.3741579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.3742248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.3742653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.3742884Z 
2025-05-07T20:32:21.3743102Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f160e210>
2025-05-07T20:32:21.3744185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.3745573Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f16458a0>}
2025-05-07T20:32:21.3747016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.3748061Z context = <triton._C.libtriton.ir.context object at 0x7f16f16ca830>
2025-05-07T20:32:21.3748353Z 
2025-05-07T20:32:21.3748526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.3749057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.3749530Z                            module_map=module_map)
2025-05-07T20:32:21.3749900Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.3750254Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.3750521Z E       ^
2025-05-07T20:32:21.3751058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.3751516Z 
2025-05-07T20:32:21.3751996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.3752524Z 
2025-05-07T20:32:21.3752631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.3753049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.3753457Z     T=128,
2025-05-07T20:32:21.3753647Z     D=7168,
2025-05-07T20:32:21.3753848Z     scale_ub=None,
2025-05-07T20:32:21.3754067Z     contiguous=True,
2025-05-07T20:32:21.3754289Z     compiled=False,
2025-05-07T20:32:21.3754500Z )
2025-05-07T20:32:21.3754827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.3755319Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.3755595Z 
2025-05-07T20:32:21.3755673Z     @given(
2025-05-07T20:32:21.3755905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.3756224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.3756607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.3756943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.3757277Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.3757564Z     )
2025-05-07T20:32:21.3757918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.3758367Z     def test_silu_mul_quant(
2025-05-07T20:32:21.3758605Z         self,
2025-05-07T20:32:21.3758807Z         T: int,
2025-05-07T20:32:21.3766094Z         D: int,
2025-05-07T20:32:21.3766435Z         scale_ub: Optional[float],
2025-05-07T20:32:21.3766719Z         contiguous: bool,
2025-05-07T20:32:21.3766961Z         compiled: bool,
2025-05-07T20:32:21.3767193Z     ) -> None:
2025-05-07T20:32:21.3767417Z         torch.manual_seed(2025)
2025-05-07T20:32:21.3767658Z     
2025-05-07T20:32:21.3767946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.3768298Z     
2025-05-07T20:32:21.3768501Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.3768801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.3769121Z         x = x_sign * x_clamp
2025-05-07T20:32:21.3769357Z         x0 = x[:, :D]
2025-05-07T20:32:21.3769580Z         x1 = x[:, D:]
2025-05-07T20:32:21.3769800Z     
2025-05-07T20:32:21.3769985Z         if contiguous:
2025-05-07T20:32:21.3770246Z             x0 = x0.contiguous()
2025-05-07T20:32:21.3770547Z             x1 = x1.contiguous()
2025-05-07T20:32:21.3770787Z     
2025-05-07T20:32:21.3770988Z         if scale_ub is not None:
2025-05-07T20:32:21.3771268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.3771612Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.3771920Z             )
2025-05-07T20:32:21.3772120Z         else:
2025-05-07T20:32:21.3772341Z             scale_ub_tensor = None
2025-05-07T20:32:21.3772592Z     
2025-05-07T20:32:21.3772916Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.3773245Z             op = silu_mul_quant
2025-05-07T20:32:21.3773493Z             if compiled:
2025-05-07T20:32:21.3773751Z                 op = torch.compile(op)
2025-05-07T20:32:21.3774054Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.3774332Z     
2025-05-07T20:32:21.3774532Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.3774696Z 
2025-05-07T20:32:21.3774848Z moe/activation_test.py:117: 
2025-05-07T20:32:21.3775252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.3775607Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.3775896Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.3776599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.3777294Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.3777901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.3778642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.3779318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.3779851Z     kernel = self.compile(
2025-05-07T20:32:21.3780403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.3781067Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.3781462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.3781699Z 
2025-05-07T20:32:21.3781910Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f13a10d0>
2025-05-07T20:32:21.3783018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.3784437Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f16467a0>}
2025-05-07T20:32:21.3785796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.3786814Z context = <triton._C.libtriton.ir.context object at 0x7f16f135d6f0>
2025-05-07T20:32:21.3787106Z 
2025-05-07T20:32:21.3787280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.3787806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.3788276Z                            module_map=module_map)
2025-05-07T20:32:21.3788642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.3789008Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.3789279Z E       ^
2025-05-07T20:32:21.3789740Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.3790209Z 
2025-05-07T20:32:21.3790632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.3791153Z 
2025-05-07T20:32:21.3791255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.3791669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.3792064Z     T=2048,
2025-05-07T20:32:21.3792262Z     D=7168,
2025-05-07T20:32:21.3792462Z     scale_ub=1200.0,
2025-05-07T20:32:21.3792684Z     contiguous=True,
2025-05-07T20:32:21.3792915Z     compiled=False,
2025-05-07T20:32:21.3793131Z )
2025-05-07T20:32:21.4459976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.4460805Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.4461198Z 
2025-05-07T20:32:21.4461312Z     @given(
2025-05-07T20:32:21.4461588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.4461909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.4462219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.4462547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.4462879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.4463166Z     )
2025-05-07T20:32:21.4463513Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.4463958Z     def test_silu_mul_quant(
2025-05-07T20:32:21.4464195Z         self,
2025-05-07T20:32:21.4464395Z         T: int,
2025-05-07T20:32:21.4464594Z         D: int,
2025-05-07T20:32:21.4464881Z         scale_ub: Optional[float],
2025-05-07T20:32:21.4465163Z         contiguous: bool,
2025-05-07T20:32:21.4465465Z         compiled: bool,
2025-05-07T20:32:21.4465693Z     ) -> None:
2025-05-07T20:32:21.4465903Z         torch.manual_seed(2025)
2025-05-07T20:32:21.4466145Z     
2025-05-07T20:32:21.4466421Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.4468470Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.4470330Z 
2025-05-07T20:32:21.4470467Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.4470748Z 
2025-05-07T20:32:21.4470850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.4471261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.4471665Z     T=1,
2025-05-07T20:32:21.4471846Z     D=5120,
2025-05-07T20:32:21.4472045Z     scale_ub=1200.0,
2025-05-07T20:32:21.4472268Z     contiguous=True,
2025-05-07T20:32:21.4472488Z     compiled=False,
2025-05-07T20:32:21.4472694Z )
2025-05-07T20:32:21.4473015Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.4473501Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.4473769Z 
2025-05-07T20:32:21.4473846Z     @given(
2025-05-07T20:32:21.4474083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.4474397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.4474699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.4475035Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.4475369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.4475651Z     )
2025-05-07T20:32:21.4476003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.4476445Z     def test_silu_mul_quant(
2025-05-07T20:32:21.4476695Z         self,
2025-05-07T20:32:21.4476884Z         T: int,
2025-05-07T20:32:21.4477085Z         D: int,
2025-05-07T20:32:21.4477361Z         scale_ub: Optional[float],
2025-05-07T20:32:21.4477706Z         contiguous: bool,
2025-05-07T20:32:21.4477949Z         compiled: bool,
2025-05-07T20:32:21.4478182Z     ) -> None:
2025-05-07T20:32:21.4478394Z         torch.manual_seed(2025)
2025-05-07T20:32:21.4478650Z     
2025-05-07T20:32:21.4478941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.4479279Z     
2025-05-07T20:32:21.4479482Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.4479968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.4480352Z         x = x_sign * x_clamp
2025-05-07T20:32:21.4480596Z         x0 = x[:, :D]
2025-05-07T20:32:21.4480816Z         x1 = x[:, D:]
2025-05-07T20:32:21.4481026Z     
2025-05-07T20:32:21.4481216Z         if contiguous:
2025-05-07T20:32:21.4481457Z             x0 = x0.contiguous()
2025-05-07T20:32:21.4481711Z             x1 = x1.contiguous()
2025-05-07T20:32:21.4481961Z     
2025-05-07T20:32:21.4482157Z         if scale_ub is not None:
2025-05-07T20:32:21.4482433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.4482764Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.4483073Z             )
2025-05-07T20:32:21.4483269Z         else:
2025-05-07T20:32:21.4483644Z             scale_ub_tensor = None
2025-05-07T20:32:21.4483898Z     
2025-05-07T20:32:21.4484132Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.4484490Z             op = silu_mul_quant
2025-05-07T20:32:21.4484748Z             if compiled:
2025-05-07T20:32:21.4485051Z                 op = torch.compile(op)
2025-05-07T20:32:21.4485353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.4485639Z     
2025-05-07T20:32:21.4485832Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.4485995Z 
2025-05-07T20:32:21.4486097Z moe/activation_test.py:117: 
2025-05-07T20:32:21.4486393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.4486733Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.4487014Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.4487704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.4488596Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.4489242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.4489933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.4490670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.4491206Z     kernel = self.compile(
2025-05-07T20:32:21.4491752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.4492407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.4492809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.4493036Z 
2025-05-07T20:32:21.4493251Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f130cdd0>
2025-05-07T20:32:21.4494332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.4495688Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1647b00>}
2025-05-07T20:32:21.4497034Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.4498061Z context = <triton._C.libtriton.ir.context object at 0x7f16f13f93f0>
2025-05-07T20:32:21.4498349Z 
2025-05-07T20:32:21.4498524Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.4499040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.4499508Z                            module_map=module_map)
2025-05-07T20:32:21.4499877Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.4500249Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.4500547Z E       ^
2025-05-07T20:32:21.4501074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.4501529Z 
2025-05-07T20:32:21.4501957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.4502473Z 
2025-05-07T20:32:21.4502587Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.4502992Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.4503397Z     T=2048,
2025-05-07T20:32:21.4503587Z     D=5120,
2025-05-07T20:32:21.4503775Z     scale_ub=None,
2025-05-07T20:32:21.4503992Z     contiguous=True,
2025-05-07T20:32:21.4504219Z     compiled=False,
2025-05-07T20:32:21.4504420Z )
2025-05-07T20:32:21.4504743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.4505281Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.4505617Z 
2025-05-07T20:32:21.4505701Z     @given(
2025-05-07T20:32:21.4505934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.4506248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.4506556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.4506877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.4507206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.4507487Z     )
2025-05-07T20:32:21.4507837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.4508282Z     def test_silu_mul_quant(
2025-05-07T20:32:21.4508518Z         self,
2025-05-07T20:32:21.4508721Z         T: int,
2025-05-07T20:32:21.4508922Z         D: int,
2025-05-07T20:32:21.4509141Z         scale_ub: Optional[float],
2025-05-07T20:32:21.4509407Z         contiguous: bool,
2025-05-07T20:32:21.4509652Z         compiled: bool,
2025-05-07T20:32:21.4509879Z     ) -> None:
2025-05-07T20:32:21.4510096Z         torch.manual_seed(2025)
2025-05-07T20:32:21.4510388Z     
2025-05-07T20:32:21.4510673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.4511060Z     
2025-05-07T20:32:21.4511260Z >       x_sign = torch.sign(x)
2025-05-07T20:32:21.4513205Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.4515054Z 
2025-05-07T20:32:21.4515183Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:21.4515396Z 
2025-05-07T20:32:21.4515513Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.4515928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.4516334Z     T=16384,
2025-05-07T20:32:21.4516530Z     D=5120,
2025-05-07T20:32:21.4516718Z     scale_ub=None,
2025-05-07T20:32:21.4516929Z     contiguous=True,
2025-05-07T20:32:21.4517154Z     compiled=False,
2025-05-07T20:32:21.4517353Z )
2025-05-07T20:32:21.5218162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5218934Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.5219320Z 
2025-05-07T20:32:21.5219442Z     @given(
2025-05-07T20:32:21.5219687Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5220007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5220308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5220649Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5221101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5221399Z     )
2025-05-07T20:32:21.5221746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5222193Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5222438Z         self,
2025-05-07T20:32:21.5222632Z         T: int,
2025-05-07T20:32:21.5222834Z         D: int,
2025-05-07T20:32:21.5223056Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5223329Z         contiguous: bool,
2025-05-07T20:32:21.5223572Z         compiled: bool,
2025-05-07T20:32:21.5223804Z     ) -> None:
2025-05-07T20:32:21.5224016Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5224259Z     
2025-05-07T20:32:21.5224540Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5226664Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.5228572Z 
2025-05-07T20:32:21.5228700Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.5228912Z 
2025-05-07T20:32:21.5229017Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5229431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5229838Z     T=4096,
2025-05-07T20:32:21.5230025Z     D=5120,
2025-05-07T20:32:21.5230222Z     scale_ub=None,
2025-05-07T20:32:21.5230447Z     contiguous=True,
2025-05-07T20:32:21.5230707Z     compiled=False,
2025-05-07T20:32:21.5230918Z )
2025-05-07T20:32:21.5231239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5231802Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.5232080Z 
2025-05-07T20:32:21.5232162Z     @given(
2025-05-07T20:32:21.5232396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5232706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5233009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5233340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5233671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5233955Z     )
2025-05-07T20:32:21.5234307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5234755Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5234993Z         self,
2025-05-07T20:32:21.5235194Z         T: int,
2025-05-07T20:32:21.5235397Z         D: int,
2025-05-07T20:32:21.5235612Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5235898Z         contiguous: bool,
2025-05-07T20:32:21.5236142Z         compiled: bool,
2025-05-07T20:32:21.5236371Z     ) -> None:
2025-05-07T20:32:21.5236583Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5236832Z     
2025-05-07T20:32:21.5237112Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5239504Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.5241383Z 
2025-05-07T20:32:21.5241592Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.5241823Z 
2025-05-07T20:32:21.5241930Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5242351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5242766Z     T=2048,
2025-05-07T20:32:21.5242953Z     D=5120,
2025-05-07T20:32:21.5243141Z     scale_ub=None,
2025-05-07T20:32:21.5243482Z     contiguous=False,
2025-05-07T20:32:21.5243709Z     compiled=False,
2025-05-07T20:32:21.5243916Z )
2025-05-07T20:32:21.5244240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5244735Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:21.5245014Z 
2025-05-07T20:32:21.5245092Z     @given(
2025-05-07T20:32:21.5245327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5245642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5246010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5246395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5246726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5247006Z     )
2025-05-07T20:32:21.5247355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5247796Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5248032Z         self,
2025-05-07T20:32:21.5248228Z         T: int,
2025-05-07T20:32:21.5248428Z         D: int,
2025-05-07T20:32:21.5248641Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5248918Z         contiguous: bool,
2025-05-07T20:32:21.5249158Z         compiled: bool,
2025-05-07T20:32:21.5249378Z     ) -> None:
2025-05-07T20:32:21.5249594Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5249837Z     
2025-05-07T20:32:21.5250109Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5252150Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.5254068Z 
2025-05-07T20:32:21.5254185Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.5254401Z 
2025-05-07T20:32:21.5254507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5254917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5255314Z     T=4096,
2025-05-07T20:32:21.5255500Z     D=7168,
2025-05-07T20:32:21.5255690Z     scale_ub=None,
2025-05-07T20:32:21.5255897Z     contiguous=True,
2025-05-07T20:32:21.5256124Z     compiled=True,
2025-05-07T20:32:21.5256332Z )
2025-05-07T20:32:21.5256647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5257137Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:21.5257409Z 
2025-05-07T20:32:21.5257486Z     @given(
2025-05-07T20:32:21.5257713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5258024Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5258328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5258655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5258973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5259261Z     )
2025-05-07T20:32:21.5259608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5260044Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5260303Z         self,
2025-05-07T20:32:21.5260539Z         T: int,
2025-05-07T20:32:21.5260739Z         D: int,
2025-05-07T20:32:21.5261008Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5261282Z         contiguous: bool,
2025-05-07T20:32:21.5261518Z         compiled: bool,
2025-05-07T20:32:21.5261733Z     ) -> None:
2025-05-07T20:32:21.5261949Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5262190Z     
2025-05-07T20:32:21.5262455Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5264540Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.5266441Z 
2025-05-07T20:32:21.5266558Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.5266769Z 
2025-05-07T20:32:21.5266879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5267288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5267683Z     T=2048,
2025-05-07T20:32:21.5267870Z     D=5120,
2025-05-07T20:32:21.5268061Z     scale_ub=1200.0,
2025-05-07T20:32:21.5268277Z     contiguous=False,
2025-05-07T20:32:21.5268502Z     compiled=False,
2025-05-07T20:32:21.5268708Z )
2025-05-07T20:32:21.5269020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5269516Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:21.5269790Z 
2025-05-07T20:32:21.5269871Z     @given(
2025-05-07T20:32:21.5270099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5270418Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5270725Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5271099Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5271420Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5271707Z     )
2025-05-07T20:32:21.5272056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5272494Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5272732Z         self,
2025-05-07T20:32:21.5272928Z         T: int,
2025-05-07T20:32:21.5273120Z         D: int,
2025-05-07T20:32:21.5273341Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5273613Z         contiguous: bool,
2025-05-07T20:32:21.5273847Z         compiled: bool,
2025-05-07T20:32:21.5274073Z     ) -> None:
2025-05-07T20:32:21.5274292Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5274529Z     
2025-05-07T20:32:21.5274801Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5276843Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.5278702Z 
2025-05-07T20:32:21.5278821Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.5279031Z 
2025-05-07T20:32:21.5279137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5279544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5279948Z     T=4096,
2025-05-07T20:32:21.5280138Z     D=7168,
2025-05-07T20:32:21.5280329Z     scale_ub=1200.0,
2025-05-07T20:32:21.5287266Z     contiguous=True,
2025-05-07T20:32:21.5287538Z     compiled=False,
2025-05-07T20:32:21.5287742Z )
2025-05-07T20:32:21.6197425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.6198162Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.6198550Z 
2025-05-07T20:32:21.6198660Z     @given(
2025-05-07T20:32:21.6198961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.6199388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.6199801Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.6200230Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.6200565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.6200886Z     )
2025-05-07T20:32:21.6201259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.6201823Z     def test_silu_mul_quant(
2025-05-07T20:32:21.6202074Z         self,
2025-05-07T20:32:21.6202342Z         T: int,
2025-05-07T20:32:21.6202544Z         D: int,
2025-05-07T20:32:21.6202761Z         scale_ub: Optional[float],
2025-05-07T20:32:21.6203032Z         contiguous: bool,
2025-05-07T20:32:21.6203269Z         compiled: bool,
2025-05-07T20:32:21.6203627Z     ) -> None:
2025-05-07T20:32:21.6203837Z         torch.manual_seed(2025)
2025-05-07T20:32:21.6204077Z     
2025-05-07T20:32:21.6204354Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.6206403Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.6208392Z 
2025-05-07T20:32:21.6208510Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.6208724Z 
2025-05-07T20:32:21.6208830Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.6209249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.6209653Z     T=16384,
2025-05-07T20:32:21.6209845Z     D=7168,
2025-05-07T20:32:21.6210041Z     scale_ub=None,
2025-05-07T20:32:21.6210259Z     contiguous=False,
2025-05-07T20:32:21.6210477Z     compiled=True,
2025-05-07T20:32:21.6210681Z )
2025-05-07T20:32:21.6210994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.6211484Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:21.6211765Z 
2025-05-07T20:32:21.6211841Z     @given(
2025-05-07T20:32:21.6212070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.6212386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.6212693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.6213021Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.6213342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.6213621Z     )
2025-05-07T20:32:21.6213964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.6214400Z     def test_silu_mul_quant(
2025-05-07T20:32:21.6214633Z         self,
2025-05-07T20:32:21.6214828Z         T: int,
2025-05-07T20:32:21.6215026Z         D: int,
2025-05-07T20:32:21.6215243Z         scale_ub: Optional[float],
2025-05-07T20:32:21.6215513Z         contiguous: bool,
2025-05-07T20:32:21.6215751Z         compiled: bool,
2025-05-07T20:32:21.6215967Z     ) -> None:
2025-05-07T20:32:21.6216176Z         torch.manual_seed(2025)
2025-05-07T20:32:21.6216415Z     
2025-05-07T20:32:21.6216752Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.6218808Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.6220667Z 
2025-05-07T20:32:21.6220784Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.6221005Z 
2025-05-07T20:32:21.6221106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.6221608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.6222137Z     T=4096,
2025-05-07T20:32:21.6222374Z     D=7168,
2025-05-07T20:32:21.6222573Z     scale_ub=None,
2025-05-07T20:32:21.6222785Z     contiguous=True,
2025-05-07T20:32:21.6223013Z     compiled=False,
2025-05-07T20:32:21.6223225Z )
2025-05-07T20:32:21.6223537Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.6224038Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.6224319Z 
2025-05-07T20:32:21.6224403Z     @given(
2025-05-07T20:32:21.6224633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.6224937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.6225243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.6225571Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.6225894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.6226191Z     )
2025-05-07T20:32:21.6226554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.6226992Z     def test_silu_mul_quant(
2025-05-07T20:32:21.6227290Z         self,
2025-05-07T20:32:21.6227487Z         T: int,
2025-05-07T20:32:21.6227679Z         D: int,
2025-05-07T20:32:21.6227894Z         scale_ub: Optional[float],
2025-05-07T20:32:21.6228163Z         contiguous: bool,
2025-05-07T20:32:21.6228392Z         compiled: bool,
2025-05-07T20:32:21.6228609Z     ) -> None:
2025-05-07T20:32:21.6228820Z         torch.manual_seed(2025)
2025-05-07T20:32:21.6229051Z     
2025-05-07T20:32:21.6229322Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.6231428Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.6233303Z 
2025-05-07T20:32:21.6233422Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.6233630Z 
2025-05-07T20:32:21.6233735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.6234137Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.6234550Z     T=16384,
2025-05-07T20:32:21.6234747Z     D=7168,
2025-05-07T20:32:21.6234936Z     scale_ub=None,
2025-05-07T20:32:21.6235145Z     contiguous=True,
2025-05-07T20:32:21.6235362Z     compiled=False,
2025-05-07T20:32:21.6235555Z )
2025-05-07T20:32:21.6235867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.6236360Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.6236637Z 
2025-05-07T20:32:21.6236714Z     @given(
2025-05-07T20:32:21.6236994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.6237311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.6237614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.6237939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.6238261Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.6238921Z     )
2025-05-07T20:32:21.6239268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.6239711Z     def test_silu_mul_quant(
2025-05-07T20:32:21.6239955Z         self,
2025-05-07T20:32:21.6240142Z         T: int,
2025-05-07T20:32:21.6240338Z         D: int,
2025-05-07T20:32:21.6240555Z         scale_ub: Optional[float],
2025-05-07T20:32:21.6240817Z         contiguous: bool,
2025-05-07T20:32:21.6241051Z         compiled: bool,
2025-05-07T20:32:21.6241272Z     ) -> None:
2025-05-07T20:32:21.6241564Z         torch.manual_seed(2025)
2025-05-07T20:32:21.6241861Z     
2025-05-07T20:32:21.6242150Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.6244309Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.6246186Z 
2025-05-07T20:32:21.6246308Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.6246517Z 
2025-05-07T20:32:21.6246618Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.6247026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.6247500Z     T=16384,
2025-05-07T20:32:21.6247684Z     D=7168,
2025-05-07T20:32:21.6247869Z     scale_ub=1200.0,
2025-05-07T20:32:21.6248086Z     contiguous=True,
2025-05-07T20:32:21.6248298Z     compiled=False,
2025-05-07T20:32:21.6248497Z )
2025-05-07T20:32:21.6248808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.6249297Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.6249574Z 
2025-05-07T20:32:21.6249645Z     @given(
2025-05-07T20:32:21.6249867Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.6250174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.6250467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.6250787Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.6251110Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.6251384Z     )
2025-05-07T20:32:21.6251735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.6252179Z     def test_silu_mul_quant(
2025-05-07T20:32:21.6252419Z         self,
2025-05-07T20:32:21.6252602Z         T: int,
2025-05-07T20:32:21.6252794Z         D: int,
2025-05-07T20:32:21.6253008Z         scale_ub: Optional[float],
2025-05-07T20:32:21.6253267Z         contiguous: bool,
2025-05-07T20:32:21.6253498Z         compiled: bool,
2025-05-07T20:32:21.6253717Z     ) -> None:
2025-05-07T20:32:21.6253921Z         torch.manual_seed(2025)
2025-05-07T20:32:21.6254155Z     
2025-05-07T20:32:21.6254421Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.6256518Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.6258373Z 
2025-05-07T20:32:21.6258488Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.6258705Z 
2025-05-07T20:32:21.6258805Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.6259211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.6259608Z     T=128,
2025-05-07T20:32:21.6259785Z     D=5120,
2025-05-07T20:32:21.6259968Z     scale_ub=1200.0,
2025-05-07T20:32:21.6260184Z     contiguous=False,
2025-05-07T20:32:21.6260400Z     compiled=False,
2025-05-07T20:32:21.6260600Z )
2025-05-07T20:32:21.7275138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.7276914Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:21.7277858Z 
2025-05-07T20:32:21.7278070Z     @given(
2025-05-07T20:32:21.7278629Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.7279264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.7279870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.7280402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.7280734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.7281023Z     )
2025-05-07T20:32:21.7281371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.7281817Z     def test_silu_mul_quant(
2025-05-07T20:32:21.7282062Z         self,
2025-05-07T20:32:21.7282254Z         T: int,
2025-05-07T20:32:21.7282451Z         D: int,
2025-05-07T20:32:21.7282670Z         scale_ub: Optional[float],
2025-05-07T20:32:21.7282938Z         contiguous: bool,
2025-05-07T20:32:21.7283181Z         compiled: bool,
2025-05-07T20:32:21.7283567Z     ) -> None:
2025-05-07T20:32:21.7283865Z         torch.manual_seed(2025)
2025-05-07T20:32:21.7284108Z     
2025-05-07T20:32:21.7284385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.7284728Z     
2025-05-07T20:32:21.7284918Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.7285211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.7285522Z         x = x_sign * x_clamp
2025-05-07T20:32:21.7285756Z         x0 = x[:, :D]
2025-05-07T20:32:21.7285978Z         x1 = x[:, D:]
2025-05-07T20:32:21.7286188Z     
2025-05-07T20:32:21.7286384Z         if contiguous:
2025-05-07T20:32:21.7286626Z             x0 = x0.contiguous()
2025-05-07T20:32:21.7286884Z             x1 = x1.contiguous()
2025-05-07T20:32:21.7287127Z     
2025-05-07T20:32:21.7287321Z         if scale_ub is not None:
2025-05-07T20:32:21.7287596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.7287934Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.7288249Z             )
2025-05-07T20:32:21.7288451Z         else:
2025-05-07T20:32:21.7288659Z             scale_ub_tensor = None
2025-05-07T20:32:21.7288917Z     
2025-05-07T20:32:21.7289153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.7289464Z             op = silu_mul_quant
2025-05-07T20:32:21.7289719Z             if compiled:
2025-05-07T20:32:21.7289968Z                 op = torch.compile(op)
2025-05-07T20:32:21.7290261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.7290540Z     
2025-05-07T20:32:21.7290730Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.7290900Z 
2025-05-07T20:32:21.7291001Z moe/activation_test.py:117: 
2025-05-07T20:32:21.7291295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.7291626Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.7291909Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.7292674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.7293375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.7293912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.7294591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.7295264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.7295805Z     kernel = self.compile(
2025-05-07T20:32:21.7296353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.7297014Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.7297415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.7297687Z 
2025-05-07T20:32:21.7297908Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f118bb90>
2025-05-07T20:32:21.7299031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.7300393Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f146e700>}
2025-05-07T20:32:21.7301734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.7302763Z context = <triton._C.libtriton.ir.context object at 0x7f16f116bef0>
2025-05-07T20:32:21.7303053Z 
2025-05-07T20:32:21.7303233Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.7303757Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.7304268Z                            module_map=module_map)
2025-05-07T20:32:21.7304631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.7304983Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.7305233Z E       ^
2025-05-07T20:32:21.7305701Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.7306155Z 
2025-05-07T20:32:21.7306577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.7307091Z 
2025-05-07T20:32:21.7307198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.7307606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.7308011Z     T=2048,
2025-05-07T20:32:21.7308201Z     D=7168,
2025-05-07T20:32:21.7308393Z     scale_ub=None,
2025-05-07T20:32:21.7308609Z     contiguous=False,
2025-05-07T20:32:21.7308829Z     compiled=False,
2025-05-07T20:32:21.7309032Z )
2025-05-07T20:32:21.7309363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.7309856Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:21.7310129Z 
2025-05-07T20:32:21.7310213Z     @given(
2025-05-07T20:32:21.7310438Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.7310751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.7311058Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.7311380Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.7311707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.7311991Z     )
2025-05-07T20:32:21.7312337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.7312779Z     def test_silu_mul_quant(
2025-05-07T20:32:21.7313065Z         self,
2025-05-07T20:32:21.7313261Z         T: int,
2025-05-07T20:32:21.7313450Z         D: int,
2025-05-07T20:32:21.7313669Z         scale_ub: Optional[float],
2025-05-07T20:32:21.7313938Z         contiguous: bool,
2025-05-07T20:32:21.7314171Z         compiled: bool,
2025-05-07T20:32:21.7314389Z     ) -> None:
2025-05-07T20:32:21.7314603Z         torch.manual_seed(2025)
2025-05-07T20:32:21.7314841Z     
2025-05-07T20:32:21.7315116Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.7317220Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.7319116Z 
2025-05-07T20:32:21.7319239Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.7319451Z 
2025-05-07T20:32:21.7319557Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.7319964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.7320365Z     T=128,
2025-05-07T20:32:21.7320552Z     D=7168,
2025-05-07T20:32:21.7320735Z     scale_ub=1200.0,
2025-05-07T20:32:21.7320958Z     contiguous=True,
2025-05-07T20:32:21.7321181Z     compiled=True,
2025-05-07T20:32:21.7321382Z )
2025-05-07T20:32:21.7623793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.7624555Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.7624947Z 
2025-05-07T20:32:21.7625082Z     @given(
2025-05-07T20:32:21.7625386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.7625975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.7626296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.7626624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.7626947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.7627230Z     )
2025-05-07T20:32:21.7627585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.7628024Z     def test_silu_mul_quant(
2025-05-07T20:32:21.7628265Z         self,
2025-05-07T20:32:21.7628465Z         T: int,
2025-05-07T20:32:21.7628660Z         D: int,
2025-05-07T20:32:21.7628884Z         scale_ub: Optional[float],
2025-05-07T20:32:21.7629158Z         contiguous: bool,
2025-05-07T20:32:21.7629397Z         compiled: bool,
2025-05-07T20:32:21.7629632Z     ) -> None:
2025-05-07T20:32:21.7629849Z         torch.manual_seed(2025)
2025-05-07T20:32:21.7630088Z     
2025-05-07T20:32:21.7630367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.7630717Z     
2025-05-07T20:32:21.7630918Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.7631210Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.7631525Z         x = x_sign * x_clamp
2025-05-07T20:32:21.7631763Z         x0 = x[:, :D]
2025-05-07T20:32:21.7631972Z         x1 = x[:, D:]
2025-05-07T20:32:21.7632183Z     
2025-05-07T20:32:21.7632370Z         if contiguous:
2025-05-07T20:32:21.7632598Z             x0 = x0.contiguous()
2025-05-07T20:32:21.7632855Z             x1 = x1.contiguous()
2025-05-07T20:32:21.7633094Z     
2025-05-07T20:32:21.7633282Z         if scale_ub is not None:
2025-05-07T20:32:21.7633569Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.7633914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.7634216Z             )
2025-05-07T20:32:21.7634408Z         else:
2025-05-07T20:32:21.7634624Z             scale_ub_tensor = None
2025-05-07T20:32:21.7634953Z     
2025-05-07T20:32:21.7635193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.7635515Z             op = silu_mul_quant
2025-05-07T20:32:21.7635765Z             if compiled:
2025-05-07T20:32:21.7636011Z                 op = torch.compile(op)
2025-05-07T20:32:21.7636305Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.7636577Z     
2025-05-07T20:32:21.7636763Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.7636927Z 
2025-05-07T20:32:21.7637027Z moe/activation_test.py:117: 
2025-05-07T20:32:21.7637323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.7637651Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.7637935Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.7638749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.7639401Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.7640129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.7640832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.7641370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.7642052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.7642723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.7643266Z     kernel = self.compile(
2025-05-07T20:32:21.7643951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.7644616Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.7645017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.7645249Z 
2025-05-07T20:32:21.7645533Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f124e650>
2025-05-07T20:32:21.7646610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.7647972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f146ff60>}
2025-05-07T20:32:21.7649333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.7650502Z context = <triton._C.libtriton.ir.context object at 0x7f16f123f270>
2025-05-07T20:32:21.7650792Z 
2025-05-07T20:32:21.7650964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.7651493Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.7651956Z                            module_map=module_map)
2025-05-07T20:32:21.7652314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.7652665Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.7652917Z E       ^
2025-05-07T20:32:21.7653377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.7653825Z 
2025-05-07T20:32:21.7654248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.7654763Z 
2025-05-07T20:32:21.7654893Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.7655301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.7655698Z     T=128,
2025-05-07T20:32:21.7655976Z     D=7168,
2025-05-07T20:32:21.7656172Z     scale_ub=1200.0,
2025-05-07T20:32:21.7656392Z     contiguous=True,
2025-05-07T20:32:21.7656612Z     compiled=False,
2025-05-07T20:32:21.7656813Z )
2025-05-07T20:32:21.7657122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.7657614Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.7657883Z 
2025-05-07T20:32:21.7657965Z     @given(
2025-05-07T20:32:21.7658187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.7658500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.7658800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.7659122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.7659448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.7659734Z     )
2025-05-07T20:32:21.7660162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.7660687Z     def test_silu_mul_quant(
2025-05-07T20:32:21.7660923Z         self,
2025-05-07T20:32:21.7661113Z         T: int,
2025-05-07T20:32:21.7661300Z         D: int,
2025-05-07T20:32:21.7661515Z         scale_ub: Optional[float],
2025-05-07T20:32:21.7661782Z         contiguous: bool,
2025-05-07T20:32:21.7662015Z         compiled: bool,
2025-05-07T20:32:21.7662235Z     ) -> None:
2025-05-07T20:32:21.7662445Z         torch.manual_seed(2025)
2025-05-07T20:32:21.7662676Z     
2025-05-07T20:32:21.7662947Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.7663288Z     
2025-05-07T20:32:21.7669693Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.7670022Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.7672035Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.7673978Z 
2025-05-07T20:32:21.7674102Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:21.7674316Z 
2025-05-07T20:32:21.7674420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.7674836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.7675243Z     T=128,
2025-05-07T20:32:21.7675428Z     D=5120,
2025-05-07T20:32:21.7675619Z     scale_ub=1200.0,
2025-05-07T20:32:21.7675842Z     contiguous=True,
2025-05-07T20:32:21.7676056Z     compiled=True,
2025-05-07T20:32:21.7676260Z )
2025-05-07T20:32:21.7676583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.7677072Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.7677345Z 
2025-05-07T20:32:21.7677422Z     @given(
2025-05-07T20:32:21.7677645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.7677948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.7678248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.7678579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.7678904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.7679181Z     )
2025-05-07T20:32:21.7679525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.7679959Z     def test_silu_mul_quant(
2025-05-07T20:32:21.7680193Z         self,
2025-05-07T20:32:21.7680385Z         T: int,
2025-05-07T20:32:21.7680585Z         D: int,
2025-05-07T20:32:21.7680799Z         scale_ub: Optional[float],
2025-05-07T20:32:21.7681120Z         contiguous: bool,
2025-05-07T20:32:21.7681367Z         compiled: bool,
2025-05-07T20:32:21.7681585Z     ) -> None:
2025-05-07T20:32:21.7681799Z         torch.manual_seed(2025)
2025-05-07T20:32:21.7682044Z     
2025-05-07T20:32:21.7682312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.7682653Z     
2025-05-07T20:32:21.7682843Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.7683125Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.7685343Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.7687231Z 
2025-05-07T20:32:21.7687347Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:21.7687557Z 
2025-05-07T20:32:21.7687659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.7688065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.7688457Z     T=128,
2025-05-07T20:32:21.7688640Z     D=7168,
2025-05-07T20:32:21.7688822Z     scale_ub=None,
2025-05-07T20:32:21.7689022Z     contiguous=True,
2025-05-07T20:32:21.7689241Z     compiled=True,
2025-05-07T20:32:21.7689438Z )
2025-05-07T20:32:21.9743486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.9744896Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:21.9745632Z 
2025-05-07T20:32:21.9745850Z     @given(
2025-05-07T20:32:21.9746322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.9746966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.9747804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.9748450Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.9749101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.9749667Z     )
2025-05-07T20:32:21.9750360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.9750826Z     def test_silu_mul_quant(
2025-05-07T20:32:21.9751070Z         self,
2025-05-07T20:32:21.9751267Z         T: int,
2025-05-07T20:32:21.9751462Z         D: int,
2025-05-07T20:32:21.9751684Z         scale_ub: Optional[float],
2025-05-07T20:32:21.9751964Z         contiguous: bool,
2025-05-07T20:32:21.9752201Z         compiled: bool,
2025-05-07T20:32:21.9752432Z     ) -> None:
2025-05-07T20:32:21.9752652Z         torch.manual_seed(2025)
2025-05-07T20:32:21.9752888Z     
2025-05-07T20:32:21.9753169Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9755225Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.9757075Z 
2025-05-07T20:32:21.9757194Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:21.9757405Z 
2025-05-07T20:32:21.9762927Z FAILED
2025-05-07T20:32:21.9763264Z 
2025-05-07T20:32:21.9763662Z =================================== FAILURES ===================================
2025-05-07T20:32:21.9764449Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:21.9765087Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:21.9765928Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:21.9766685Z   |     yield
2025-05-07T20:32:21.9767294Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:21.9768021Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:21.9768802Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:21.9769561Z   |     if method() is not None:
2025-05-07T20:32:21.9769898Z   |        ^^^^^^^^
2025-05-07T20:32:21.9770929Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:21.9772022Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.9772505Z   |            ^^^^^^^
2025-05-07T20:32:21.9773280Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:21.9774156Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:21.9774737Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:21.9775303Z   +-+---------------- 1 ----------------
2025-05-07T20:32:21.9775713Z     | Traceback (most recent call last):
2025-05-07T20:32:21.9776684Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:21.9777745Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9778246Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9781016Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.9783820Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:21.9784483Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9785033Z     |     T=2048,
2025-05-07T20:32:21.9785357Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:21.9785844Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:21.9786350Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:21.9786858Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:21.9787310Z     | )
2025-05-07T20:32:21.9787578Z     | 
2025-05-07T20:32:21.9788290Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:21.9789123Z     +---------------- 2 ----------------
2025-05-07T20:32:21.9789525Z     | Traceback (most recent call last):
2025-05-07T20:32:21.9790528Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:21.9791607Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9792103Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9794898Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.9797618Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:21.9798249Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9798798Z     |     T=128,
2025-05-07T20:32:21.9799072Z     |     D=7168,
2025-05-07T20:32:21.9799359Z     |     scale_ub=None,
2025-05-07T20:32:21.9799696Z     |     contiguous=True,
2025-05-07T20:32:21.9800016Z     |     compiled=True,
2025-05-07T20:32:21.9800323Z     | )
2025-05-07T20:32:21.9800575Z     | 
2025-05-07T20:32:21.9801377Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:21.9802286Z     +---------------- 3 ----------------
2025-05-07T20:32:21.9802678Z     | Traceback (most recent call last):
2025-05-07T20:32:21.9803783Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:21.9804865Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9805374Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9807545Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:21.9809561Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:21.9809994Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9810396Z     |     T=128,
2025-05-07T20:32:21.9810606Z     |     D=5120,
2025-05-07T20:32:21.9810898Z     |     scale_ub=1200.0,
2025-05-07T20:32:21.9811237Z     |     contiguous=True,
2025-05-07T20:32:21.9811579Z     |     compiled=True,
2025-05-07T20:32:21.9811889Z     | )
2025-05-07T20:32:21.9812138Z     | 
2025-05-07T20:32:21.9812881Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:21.9813745Z     +---------------- 4 ----------------
2025-05-07T20:32:21.9814164Z     | Traceback (most recent call last):
2025-05-07T20:32:21.9815203Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:21.9816247Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.9816660Z     |                              ^^^^^^^^
2025-05-07T20:32:21.9817570Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:21.9818277Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.9818609Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9819426Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:21.9820227Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.9820988Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:21.9822040Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.9822649Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9823496Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:21.9824556Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.9825217Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9826161Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:21.9827370Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.9828086Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9828980Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:21.9829940Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.9830451Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9831286Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:21.9832082Z     |     fn()
2025-05-07T20:32:21.9832891Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:21.9833752Z     |     self.fn.run(
2025-05-07T20:32:21.9834483Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:21.9835379Z     |     kernel = self.compile(
2025-05-07T20:32:21.9835752Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:21.9836604Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:21.9837586Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.9838115Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9839311Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.9840453Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.9841131Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9841656Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.9842149Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:21.9842508Z     | ^
2025-05-07T20:32:21.9843149Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.9844154Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:21.9844691Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:21.9845409Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9846009Z     |     T=1,  # or any other generated value
2025-05-07T20:32:21.9846432Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:21.9846884Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:21.9847387Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:21.9847913Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:21.9848513Z     | )
2025-05-07T20:32:21.9848766Z     | 
2025-05-07T20:32:21.9849515Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:21.9850374Z     +------------------------------------
2025-05-07T20:32:21.9850855Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:21.9851376Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.9851948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9852480Z     T=1,
2025-05-07T20:32:21.9852734Z     D=5120,
2025-05-07T20:32:21.9853001Z     scale_ub=None,
2025-05-07T20:32:21.9853292Z     contiguous=True,
2025-05-07T20:32:21.9853604Z     compiled=True,
2025-05-07T20:32:21.9853895Z )
2025-05-07T20:32:21.9854333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.9855099Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:21.9855545Z 
2025-05-07T20:32:21.9855654Z     @given(
2025-05-07T20:32:21.9855976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.9856397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.9856823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.9857292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.9857744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.9858142Z     )
2025-05-07T20:32:21.9858627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.9859234Z     def test_silu_mul_quant(
2025-05-07T20:32:21.9859567Z         self,
2025-05-07T20:32:21.9859837Z         T: int,
2025-05-07T20:32:21.9860098Z         D: int,
2025-05-07T20:32:21.9860396Z         scale_ub: Optional[float],
2025-05-07T20:32:21.9860754Z         contiguous: bool,
2025-05-07T20:32:21.9861076Z         compiled: bool,
2025-05-07T20:32:21.9861393Z     ) -> None:
2025-05-07T20:32:21.9861800Z         torch.manual_seed(2025)
2025-05-07T20:32:21.9862136Z     
2025-05-07T20:32:21.9862520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9863005Z     
2025-05-07T20:32:21.9863280Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.9863678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.9864118Z         x = x_sign * x_clamp
2025-05-07T20:32:21.9864452Z         x0 = x[:, :D]
2025-05-07T20:32:21.9864720Z         x1 = x[:, D:]
2025-05-07T20:32:21.9865018Z     
2025-05-07T20:32:21.9865278Z         if contiguous:
2025-05-07T20:32:21.9865604Z             x0 = x0.contiguous()
2025-05-07T20:32:21.9865964Z             x1 = x1.contiguous()
2025-05-07T20:32:21.9866304Z     
2025-05-07T20:32:21.9866568Z         if scale_ub is not None:
2025-05-07T20:32:21.9866965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.9867439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.9867875Z             )
2025-05-07T20:32:21.9868147Z         else:
2025-05-07T20:32:21.9868443Z             scale_ub_tensor = None
2025-05-07T20:32:21.9868803Z     
2025-05-07T20:32:21.9869127Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.9869557Z             op = silu_mul_quant
2025-05-07T20:32:21.9869912Z             if compiled:
2025-05-07T20:32:21.9870255Z                 op = torch.compile(op)
2025-05-07T20:32:21.9870662Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.9871049Z     
2025-05-07T20:32:21.9871310Z         y_fp8, y_scale = fn()
2025-05-07T20:32:21.9871709Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:21.9872120Z     
2025-05-07T20:32:21.9872444Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.9872910Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:21.9873327Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:21.9873817Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:21.9874322Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.9874748Z     
2025-05-07T20:32:21.9875028Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.9875294Z 
2025-05-07T20:32:21.9875430Z moe/activation_test.py:126: 
2025-05-07T20:32:21.9875842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9876313Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:21.9876765Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.9877879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:21.9878957Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.9879735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.9880702Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.9881686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:21.9882690Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.9883892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:21.9884951Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.9885983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:21.9886892Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.9887740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:21.9888460Z     fn()
2025-05-07T20:32:21.9889212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:21.9890007Z     self.fn.run(
2025-05-07T20:32:21.9890685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.9891386Z     kernel = self.compile(
2025-05-07T20:32:21.9892106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.9892966Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.9893522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9893859Z 
2025-05-07T20:32:21.9894145Z self = <triton.compiler.compiler.ASTSource object at 0x7f18d1608250>
2025-05-07T20:32:21.9895666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.9897623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cda813a0>}
2025-05-07T20:32:21.9899409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.9900830Z context = <triton._C.libtriton.ir.context object at 0x7f18fde1e330>
2025-05-07T20:32:21.9901209Z 
2025-05-07T20:32:21.9901432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.9902128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.9902745Z                            module_map=module_map)
2025-05-07T20:32:21.9903312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.9903790Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:21.9904128Z E       ^
2025-05-07T20:32:21.9904739Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.9905335Z 
2025-05-07T20:32:21.9905893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.9906582Z 
2025-05-07T20:32:21.9906724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.9907288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9907837Z     T=2048,
2025-05-07T20:32:21.9908086Z     D=5120,
2025-05-07T20:32:21.9908333Z     scale_ub=1200.0,
2025-05-07T20:32:21.9908628Z     contiguous=True,
2025-05-07T20:32:21.9908939Z     compiled=False,
2025-05-07T20:32:21.9909262Z )
2025-05-07T20:32:21.9909714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.9910473Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.9910861Z 
2025-05-07T20:32:21.9910974Z     @given(
2025-05-07T20:32:21.9911271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.9911697Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.9912101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.9912535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.9912974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.9913363Z     )
2025-05-07T20:32:21.9913818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.9914413Z     def test_silu_mul_quant(
2025-05-07T20:32:21.9914735Z         self,
2025-05-07T20:32:21.9914991Z         T: int,
2025-05-07T20:32:21.9915256Z         D: int,
2025-05-07T20:32:21.9915545Z         scale_ub: Optional[float],
2025-05-07T20:32:21.9915983Z         contiguous: bool,
2025-05-07T20:32:21.9937593Z         compiled: bool,
2025-05-07T20:32:21.9937917Z     ) -> None:
2025-05-07T20:32:21.9938220Z         torch.manual_seed(2025)
2025-05-07T20:32:21.9938823Z     
2025-05-07T20:32:21.9939203Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9939660Z     
2025-05-07T20:32:21.9939936Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.9940338Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.9940769Z         x = x_sign * x_clamp
2025-05-07T20:32:21.9941090Z         x0 = x[:, :D]
2025-05-07T20:32:21.9941378Z         x1 = x[:, D:]
2025-05-07T20:32:21.9941660Z     
2025-05-07T20:32:21.9941911Z         if contiguous:
2025-05-07T20:32:21.9942223Z             x0 = x0.contiguous()
2025-05-07T20:32:21.9942563Z             x1 = x1.contiguous()
2025-05-07T20:32:21.9942895Z     
2025-05-07T20:32:21.9943168Z         if scale_ub is not None:
2025-05-07T20:32:21.9943536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.9944000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.9944411Z             )
2025-05-07T20:32:21.9944663Z         else:
2025-05-07T20:32:21.9944963Z             scale_ub_tensor = None
2025-05-07T20:32:21.9945316Z     
2025-05-07T20:32:21.9945634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.9946083Z             op = silu_mul_quant
2025-05-07T20:32:21.9946436Z             if compiled:
2025-05-07T20:32:21.9946764Z                 op = torch.compile(op)
2025-05-07T20:32:21.9947164Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.9947545Z     
2025-05-07T20:32:21.9947791Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.9948019Z 
2025-05-07T20:32:21.9948157Z moe/activation_test.py:117: 
2025-05-07T20:32:21.9948577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9949050Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.9949631Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.9950599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.9951555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.9952292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.9953258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.9954200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.9954970Z     kernel = self.compile(
2025-05-07T20:32:21.9955733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.9956765Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.9957345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9957775Z 
2025-05-07T20:32:21.9958068Z self = <triton.compiler.compiler.ASTSource object at 0x7f18cdc4f590>
2025-05-07T20:32:21.9959560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.9961518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cd7382c0>}
2025-05-07T20:32:21.9963456Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.9964919Z context = <triton._C.libtriton.ir.context object at 0x7f18cdeab9b0>
2025-05-07T20:32:21.9965309Z 
2025-05-07T20:32:21.9965629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.9966332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.9966968Z                            module_map=module_map)
2025-05-07T20:32:21.9967493Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.9967983Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.9968347Z E       ^
2025-05-07T20:32:21.9968993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.9969616Z 
2025-05-07T20:32:21.9970191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.9970938Z 
2025-05-07T20:32:21.9971078Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.9971647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9972188Z     T=2048,
2025-05-07T20:32:21.9972435Z     D=5120,
2025-05-07T20:32:21.9972693Z     scale_ub=1200.0,
2025-05-07T20:32:21.9972999Z     contiguous=True,
2025-05-07T20:32:21.9973293Z     compiled=True,
2025-05-07T20:32:21.9973572Z )
2025-05-07T20:32:21.9974024Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.9974704Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.9975091Z 
2025-05-07T20:32:21.9975199Z     @given(
2025-05-07T20:32:21.9975522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.9975975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.9976375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.9976822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.9977176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.9977460Z     )
2025-05-07T20:32:21.9977878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.9978335Z     def test_silu_mul_quant(
2025-05-07T20:32:21.9978575Z         self,
2025-05-07T20:32:21.9978778Z         T: int,
2025-05-07T20:32:21.9978987Z         D: int,
2025-05-07T20:32:21.9979201Z         scale_ub: Optional[float],
2025-05-07T20:32:21.9979478Z         contiguous: bool,
2025-05-07T20:32:21.9979721Z         compiled: bool,
2025-05-07T20:32:21.9979941Z     ) -> None:
2025-05-07T20:32:21.9980162Z         torch.manual_seed(2025)
2025-05-07T20:32:21.9980406Z     
2025-05-07T20:32:21.9980671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9981017Z     
2025-05-07T20:32:21.9981212Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.9981499Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.9981809Z         x = x_sign * x_clamp
2025-05-07T20:32:21.9982099Z         x0 = x[:, :D]
2025-05-07T20:32:21.9982314Z         x1 = x[:, D:]
2025-05-07T20:32:21.9982576Z     
2025-05-07T20:32:21.9982764Z         if contiguous:
2025-05-07T20:32:21.9982991Z             x0 = x0.contiguous()
2025-05-07T20:32:21.9983251Z             x1 = x1.contiguous()
2025-05-07T20:32:21.9983494Z     
2025-05-07T20:32:21.9983679Z         if scale_ub is not None:
2025-05-07T20:32:21.9983954Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.9984291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.9984605Z             )
2025-05-07T20:32:21.9984796Z         else:
2025-05-07T20:32:21.9985006Z             scale_ub_tensor = None
2025-05-07T20:32:21.9985261Z     
2025-05-07T20:32:21.9985486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.9985798Z             op = silu_mul_quant
2025-05-07T20:32:21.9986047Z             if compiled:
2025-05-07T20:32:21.9986288Z                 op = torch.compile(op)
2025-05-07T20:32:21.9986590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.9986870Z     
2025-05-07T20:32:21.9987108Z         y_fp8, y_scale = fn()
2025-05-07T20:32:21.9987394Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:21.9987682Z     
2025-05-07T20:32:21.9987918Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.9988256Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:21.9988548Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:21.9988866Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:21.9989223Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.9989537Z     
2025-05-07T20:32:21.9989737Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.9989931Z 
2025-05-07T20:32:21.9990033Z moe/activation_test.py:126: 
2025-05-07T20:32:21.9990330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9990666Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:21.9990994Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.9991797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:21.9992552Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.9993097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.9993775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.9994469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:21.9995195Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.9995952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:21.9996746Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.9997487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:21.9998128Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.9998731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:21.9999246Z     fn()
2025-05-07T20:32:21.9999755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0000349Z     self.fn.run(
2025-05-07T20:32:22.0000852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0001387Z     kernel = self.compile(
2025-05-07T20:32:22.0001975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0002666Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0003067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0003405Z 
2025-05-07T20:32:22.0003612Z self = <triton.compiler.compiler.ASTSource object at 0x7f18cc4ca950>
2025-05-07T20:32:22.0004692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0006059Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cd739440>}
2025-05-07T20:32:22.0007399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0008472Z context = <triton._C.libtriton.ir.context object at 0x7f18cc4cef30>
2025-05-07T20:32:22.0008766Z 
2025-05-07T20:32:22.0008933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0009456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0009915Z                            module_map=module_map)
2025-05-07T20:32:22.0010282Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0010640Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0010900Z E       ^
2025-05-07T20:32:22.0011364Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0011823Z 
2025-05-07T20:32:22.0012244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0012758Z 
2025-05-07T20:32:22.0012872Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0013282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0013686Z     T=16384,
2025-05-07T20:32:22.0013878Z     D=7168,
2025-05-07T20:32:22.0014063Z     scale_ub=1200.0,
2025-05-07T20:32:22.0014288Z     contiguous=False,
2025-05-07T20:32:22.0014513Z     compiled=False,
2025-05-07T20:32:22.0014712Z )
2025-05-07T20:32:22.0015032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0015536Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0015815Z 
2025-05-07T20:32:22.0015901Z     @given(
2025-05-07T20:32:22.0016124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0016438Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0016744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0017071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0017448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0017744Z     )
2025-05-07T20:32:22.0018086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0018530Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0018770Z         self,
2025-05-07T20:32:22.0018961Z         T: int,
2025-05-07T20:32:22.0019153Z         D: int,
2025-05-07T20:32:22.0019370Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0019637Z         contiguous: bool,
2025-05-07T20:32:22.0019871Z         compiled: bool,
2025-05-07T20:32:22.0020092Z     ) -> None:
2025-05-07T20:32:22.0020302Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0020540Z     
2025-05-07T20:32:22.0020812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0021155Z     
2025-05-07T20:32:22.0021342Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0021682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0022056Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0022304Z         x0 = x[:, :D]
2025-05-07T20:32:22.0022522Z         x1 = x[:, D:]
2025-05-07T20:32:22.0022730Z     
2025-05-07T20:32:22.0022910Z         if contiguous:
2025-05-07T20:32:22.0023148Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0023403Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0023638Z     
2025-05-07T20:32:22.0023834Z         if scale_ub is not None:
2025-05-07T20:32:22.0024108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0024439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0024748Z             )
2025-05-07T20:32:22.0024939Z         else:
2025-05-07T20:32:22.0025146Z             scale_ub_tensor = None
2025-05-07T20:32:22.0025401Z     
2025-05-07T20:32:22.0025633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0025939Z             op = silu_mul_quant
2025-05-07T20:32:22.0026192Z             if compiled:
2025-05-07T20:32:22.0026445Z                 op = torch.compile(op)
2025-05-07T20:32:22.0026786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0027059Z     
2025-05-07T20:32:22.0027252Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0027415Z 
2025-05-07T20:32:22.0027518Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0027805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0028132Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0028415Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0029097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0029791Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0030337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0031078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0031747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0032295Z     kernel = self.compile(
2025-05-07T20:32:22.0032841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0033500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0033901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0034134Z 
2025-05-07T20:32:22.0034341Z self = <triton.compiler.compiler.ASTSource object at 0x7f18cc50d090>
2025-05-07T20:32:22.0035422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0036836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc82e660>}
2025-05-07T20:32:22.0038187Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0039541Z context = <triton._C.libtriton.ir.context object at 0x7f18cc5096b0>
2025-05-07T20:32:22.0039834Z 
2025-05-07T20:32:22.0040002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0040530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0040992Z                            module_map=module_map)
2025-05-07T20:32:22.0041363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0041718Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0042070Z E       ^
2025-05-07T20:32:22.0042596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0043056Z 
2025-05-07T20:32:22.0043587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0044104Z 
2025-05-07T20:32:22.0044216Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0044625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0045031Z     T=1,
2025-05-07T20:32:22.0045217Z     D=7168,
2025-05-07T20:32:22.0045405Z     scale_ub=None,
2025-05-07T20:32:22.0045619Z     contiguous=True,
2025-05-07T20:32:22.0045841Z     compiled=True,
2025-05-07T20:32:22.0046039Z )
2025-05-07T20:32:22.0046359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0046849Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0047105Z 
2025-05-07T20:32:22.0047195Z     @given(
2025-05-07T20:32:22.0047498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0047816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0048126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0048452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0048780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0049069Z     )
2025-05-07T20:32:22.0049411Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0049853Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0050097Z         self,
2025-05-07T20:32:22.0050285Z         T: int,
2025-05-07T20:32:22.0050506Z         D: int,
2025-05-07T20:32:22.0050747Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0051015Z         contiguous: bool,
2025-05-07T20:32:22.0051251Z         compiled: bool,
2025-05-07T20:32:22.0051476Z     ) -> None:
2025-05-07T20:32:22.0051692Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0051930Z     
2025-05-07T20:32:22.0052208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0052556Z     
2025-05-07T20:32:22.0052745Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0053037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0053348Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0053581Z         x0 = x[:, :D]
2025-05-07T20:32:22.0053801Z         x1 = x[:, D:]
2025-05-07T20:32:22.0054011Z     
2025-05-07T20:32:22.0054191Z         if contiguous:
2025-05-07T20:32:22.0054424Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0054682Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0054917Z     
2025-05-07T20:32:22.0055115Z         if scale_ub is not None:
2025-05-07T20:32:22.0055389Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0055717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0056032Z             )
2025-05-07T20:32:22.0056228Z         else:
2025-05-07T20:32:22.0056516Z             scale_ub_tensor = None
2025-05-07T20:32:22.0056766Z     
2025-05-07T20:32:22.0057000Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0057316Z             op = silu_mul_quant
2025-05-07T20:32:22.0057562Z             if compiled:
2025-05-07T20:32:22.0057810Z                 op = torch.compile(op)
2025-05-07T20:32:22.0058106Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0058375Z     
2025-05-07T20:32:22.0058572Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0058860Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0059148Z     
2025-05-07T20:32:22.0059388Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0059724Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0060015Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0060386Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0060797Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0061150Z     
2025-05-07T20:32:22.0061349Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0061551Z 
2025-05-07T20:32:22.0061651Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0061947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0062279Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0062612Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0063411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0064174Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0064716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0065408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0066107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0066875Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0067631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0068384Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0069120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0069763Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0070374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0070952Z     fn()
2025-05-07T20:32:22.0071470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0072056Z     self.fn.run(
2025-05-07T20:32:22.0072535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0073070Z     kernel = self.compile(
2025-05-07T20:32:22.0073610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0074271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0074670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0074898Z 
2025-05-07T20:32:22.0075112Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3f64990>
2025-05-07T20:32:22.0076234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0077617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc5ea5c0>}
2025-05-07T20:32:22.0078969Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0080001Z context = <triton._C.libtriton.ir.context object at 0x7f18a3f7b030>
2025-05-07T20:32:22.0080288Z 
2025-05-07T20:32:22.0080461Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0080980Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0081452Z                            module_map=module_map)
2025-05-07T20:32:22.0081862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0082268Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0082540Z E       ^
2025-05-07T20:32:22.0083008Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0083572Z 
2025-05-07T20:32:22.0084000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0084517Z 
2025-05-07T20:32:22.0084622Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0085036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0085441Z     T=4096,
2025-05-07T20:32:22.0085627Z     D=5120,
2025-05-07T20:32:22.0085821Z     scale_ub=None,
2025-05-07T20:32:22.0086035Z     contiguous=False,
2025-05-07T20:32:22.0086255Z     compiled=False,
2025-05-07T20:32:22.0086465Z )
2025-05-07T20:32:22.0086789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0087289Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0087611Z 
2025-05-07T20:32:22.0087688Z     @given(
2025-05-07T20:32:22.0087918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0088236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0088540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0088872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0089201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0089480Z     )
2025-05-07T20:32:22.0089829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0090269Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0090505Z         self,
2025-05-07T20:32:22.0090723Z         T: int,
2025-05-07T20:32:22.0090946Z         D: int,
2025-05-07T20:32:22.0091162Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0091430Z         contiguous: bool,
2025-05-07T20:32:22.0091671Z         compiled: bool,
2025-05-07T20:32:22.0091904Z     ) -> None:
2025-05-07T20:32:22.0092112Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0092354Z     
2025-05-07T20:32:22.0092629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0092966Z     
2025-05-07T20:32:22.0093161Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0093452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0093753Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0093990Z         x0 = x[:, :D]
2025-05-07T20:32:22.0094211Z         x1 = x[:, D:]
2025-05-07T20:32:22.0094416Z     
2025-05-07T20:32:22.0094602Z         if contiguous:
2025-05-07T20:32:22.0094838Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0095089Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0095329Z     
2025-05-07T20:32:22.0095518Z         if scale_ub is not None:
2025-05-07T20:32:22.0095788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0096168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0103997Z             )
2025-05-07T20:32:22.0104230Z         else:
2025-05-07T20:32:22.0104450Z             scale_ub_tensor = None
2025-05-07T20:32:22.0104709Z     
2025-05-07T20:32:22.0104954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0105277Z             op = silu_mul_quant
2025-05-07T20:32:22.0105522Z             if compiled:
2025-05-07T20:32:22.0105777Z                 op = torch.compile(op)
2025-05-07T20:32:22.0106080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0106357Z     
2025-05-07T20:32:22.0106550Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0106723Z 
2025-05-07T20:32:22.0106824Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0107124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0107458Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0107861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0108566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0109300Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0109845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0110533Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0111214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0111742Z     kernel = self.compile(
2025-05-07T20:32:22.0112288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0112949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0113351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0113583Z 
2025-05-07T20:32:22.0113838Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3fb0790>
2025-05-07T20:32:22.0114915Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0116278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc068540>}
2025-05-07T20:32:22.0117615Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0118627Z context = <triton._C.libtriton.ir.context object at 0x7f18a3fe0a30>
2025-05-07T20:32:22.0118919Z 
2025-05-07T20:32:22.0119092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0119623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0120088Z                            module_map=module_map)
2025-05-07T20:32:22.0120448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0120810Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0121073Z E       ^
2025-05-07T20:32:22.0121535Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0121991Z 
2025-05-07T20:32:22.0122409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0122925Z 
2025-05-07T20:32:22.0123027Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0123558Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0123956Z     T=4096,
2025-05-07T20:32:22.0124195Z     D=7168,
2025-05-07T20:32:22.0124393Z     scale_ub=None,
2025-05-07T20:32:22.0124599Z     contiguous=False,
2025-05-07T20:32:22.0124831Z     compiled=False,
2025-05-07T20:32:22.0125037Z )
2025-05-07T20:32:22.0125347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0125840Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0126118Z 
2025-05-07T20:32:22.0126192Z     @given(
2025-05-07T20:32:22.0126423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0126729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0127037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0127363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0127682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0127965Z     )
2025-05-07T20:32:22.0128360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0128835Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0129077Z         self,
2025-05-07T20:32:22.0129271Z         T: int,
2025-05-07T20:32:22.0129461Z         D: int,
2025-05-07T20:32:22.0129679Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0129955Z         contiguous: bool,
2025-05-07T20:32:22.0130198Z         compiled: bool,
2025-05-07T20:32:22.0130413Z     ) -> None:
2025-05-07T20:32:22.0130627Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0130864Z     
2025-05-07T20:32:22.0131131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0131470Z     
2025-05-07T20:32:22.0131661Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0131942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0132247Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0132486Z         x0 = x[:, :D]
2025-05-07T20:32:22.0132690Z         x1 = x[:, D:]
2025-05-07T20:32:22.0132901Z     
2025-05-07T20:32:22.0133092Z         if contiguous:
2025-05-07T20:32:22.0133371Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0133628Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0133859Z     
2025-05-07T20:32:22.0134050Z         if scale_ub is not None:
2025-05-07T20:32:22.0134313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0134648Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0134953Z             )
2025-05-07T20:32:22.0135146Z         else:
2025-05-07T20:32:22.0135346Z             scale_ub_tensor = None
2025-05-07T20:32:22.0135589Z     
2025-05-07T20:32:22.0135816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0136120Z             op = silu_mul_quant
2025-05-07T20:32:22.0136364Z             if compiled:
2025-05-07T20:32:22.0136610Z                 op = torch.compile(op)
2025-05-07T20:32:22.0136897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0137163Z     
2025-05-07T20:32:22.0137355Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0137520Z 
2025-05-07T20:32:22.0137623Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0137916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0138243Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0138793Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0139489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0140174Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0140712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0141389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0142053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0142585Z     kernel = self.compile(
2025-05-07T20:32:22.0143224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0143405Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0143534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0143539Z 
2025-05-07T20:32:22.0143746Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3683050>
2025-05-07T20:32:22.0144518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0145019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc5cc720>}
2025-05-07T20:32:22.0145826Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0146066Z context = <triton._C.libtriton.ir.context object at 0x7f18a363b670>
2025-05-07T20:32:22.0146075Z 
2025-05-07T20:32:22.0146239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0146503Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0146615Z                            module_map=module_map)
2025-05-07T20:32:22.0146774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0146870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0146952Z E       ^
2025-05-07T20:32:22.0147307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0147312Z 
2025-05-07T20:32:22.0147736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0147891Z 
2025-05-07T20:32:22.0147995Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0148216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0148299Z     T=128,
2025-05-07T20:32:22.0148375Z     D=7168,
2025-05-07T20:32:22.0148458Z     scale_ub=None,
2025-05-07T20:32:22.0148548Z     contiguous=False,
2025-05-07T20:32:22.0148632Z     compiled=True,
2025-05-07T20:32:22.0148704Z )
2025-05-07T20:32:22.0148924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0149091Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0149096Z 
2025-05-07T20:32:22.0149179Z     @given(
2025-05-07T20:32:22.0149297Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0149397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0149519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0149643Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0149754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0149834Z     )
2025-05-07T20:32:22.0150079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0150177Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0150254Z         self,
2025-05-07T20:32:22.0150330Z         T: int,
2025-05-07T20:32:22.0150412Z         D: int,
2025-05-07T20:32:22.0150509Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0150596Z         contiguous: bool,
2025-05-07T20:32:22.0150686Z         compiled: bool,
2025-05-07T20:32:22.0150766Z     ) -> None:
2025-05-07T20:32:22.0150862Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0150938Z     
2025-05-07T20:32:22.0151131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0151216Z     
2025-05-07T20:32:22.0151323Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0151502Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0151589Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0151675Z         x0 = x[:, :D]
2025-05-07T20:32:22.0151754Z         x1 = x[:, D:]
2025-05-07T20:32:22.0151830Z     
2025-05-07T20:32:22.0151911Z         if contiguous:
2025-05-07T20:32:22.0152002Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0152095Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0152165Z     
2025-05-07T20:32:22.0152253Z         if scale_ub is not None:
2025-05-07T20:32:22.0152360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0152495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0152570Z             )
2025-05-07T20:32:22.0152652Z         else:
2025-05-07T20:32:22.0152745Z             scale_ub_tensor = None
2025-05-07T20:32:22.0152816Z     
2025-05-07T20:32:22.0152993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0153124Z             op = silu_mul_quant
2025-05-07T20:32:22.0153213Z             if compiled:
2025-05-07T20:32:22.0153312Z                 op = torch.compile(op)
2025-05-07T20:32:22.0153415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0153493Z     
2025-05-07T20:32:22.0153583Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0153704Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0153780Z     
2025-05-07T20:32:22.0153915Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0154014Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0154115Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0154235Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0154374Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0154454Z     
2025-05-07T20:32:22.0154553Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0154563Z 
2025-05-07T20:32:22.0154668Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0154842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0154947Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0155084Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0155646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0155749Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0156119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0156341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0156718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0156977Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0157383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0157642Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0158019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0158188Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0158534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0158610Z     fn()
2025-05-07T20:32:22.0159018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0159102Z     self.fn.run(
2025-05-07T20:32:22.0159511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0159615Z     kernel = self.compile(
2025-05-07T20:32:22.0159997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0160174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0160300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0160305Z 
2025-05-07T20:32:22.0160508Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3d90b90>
2025-05-07T20:32:22.0161285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0161819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18cc0bb060>}
2025-05-07T20:32:22.0162611Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0162800Z context = <triton._C.libtriton.ir.context object at 0x7f18a3ddafb0>
2025-05-07T20:32:22.0162805Z 
2025-05-07T20:32:22.0162972Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0163236Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0163448Z                            module_map=module_map)
2025-05-07T20:32:22.0163616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0163716Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0163793Z E       ^
2025-05-07T20:32:22.0164155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0164208Z 
2025-05-07T20:32:22.0164625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0164630Z 
2025-05-07T20:32:22.0164741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0164964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0165041Z     T=128,
2025-05-07T20:32:22.0165120Z     D=7168,
2025-05-07T20:32:22.0165202Z     scale_ub=None,
2025-05-07T20:32:22.0165289Z     contiguous=False,
2025-05-07T20:32:22.0165375Z     compiled=False,
2025-05-07T20:32:22.0165446Z )
2025-05-07T20:32:22.0165663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0165839Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0165843Z 
2025-05-07T20:32:22.0165920Z     @given(
2025-05-07T20:32:22.0166044Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0166149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0166263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0166385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0166496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0166570Z     )
2025-05-07T20:32:22.0166818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0166911Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0166992Z         self,
2025-05-07T20:32:22.0167070Z         T: int,
2025-05-07T20:32:22.0167737Z         D: int,
2025-05-07T20:32:22.0167840Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0167928Z         contiguous: bool,
2025-05-07T20:32:22.0168010Z         compiled: bool,
2025-05-07T20:32:22.0168092Z     ) -> None:
2025-05-07T20:32:22.0168186Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0168259Z     
2025-05-07T20:32:22.0168482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0168560Z     
2025-05-07T20:32:22.0168650Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0168779Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0168866Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0168949Z         x0 = x[:, :D]
2025-05-07T20:32:22.0169027Z         x1 = x[:, D:]
2025-05-07T20:32:22.0169097Z     
2025-05-07T20:32:22.0169186Z         if contiguous:
2025-05-07T20:32:22.0169275Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0169362Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0169438Z     
2025-05-07T20:32:22.0169528Z         if scale_ub is not None:
2025-05-07T20:32:22.0169631Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0169768Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0169841Z             )
2025-05-07T20:32:22.0169921Z         else:
2025-05-07T20:32:22.0170056Z             scale_ub_tensor = None
2025-05-07T20:32:22.0170164Z     
2025-05-07T20:32:22.0170300Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0170394Z             op = silu_mul_quant
2025-05-07T20:32:22.0170479Z             if compiled:
2025-05-07T20:32:22.0170580Z                 op = torch.compile(op)
2025-05-07T20:32:22.0170690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0170762Z     
2025-05-07T20:32:22.0170850Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0170854Z 
2025-05-07T20:32:22.0170954Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0171080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0171180Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0171277Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0171777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0171887Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0172257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0172522Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0172871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0172965Z     kernel = self.compile(
2025-05-07T20:32:22.0173358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0173532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0173655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0173659Z 
2025-05-07T20:32:22.0173868Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3c70490>
2025-05-07T20:32:22.0174641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0175150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3a88cc0>}
2025-05-07T20:32:22.0175900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0176089Z context = <triton._C.libtriton.ir.context object at 0x7f18a3c64af0>
2025-05-07T20:32:22.0176099Z 
2025-05-07T20:32:22.0176263Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0176527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0176676Z                            module_map=module_map)
2025-05-07T20:32:22.0176846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0176942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0177026Z E       ^
2025-05-07T20:32:22.0177380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0177385Z 
2025-05-07T20:32:22.0177803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0177808Z 
2025-05-07T20:32:22.0177911Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0178132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0178212Z     T=4096,
2025-05-07T20:32:22.0178291Z     D=5120,
2025-05-07T20:32:22.0178374Z     scale_ub=1200.0,
2025-05-07T20:32:22.0178462Z     contiguous=True,
2025-05-07T20:32:22.0178587Z     compiled=False,
2025-05-07T20:32:22.0178663Z )
2025-05-07T20:32:22.0178926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0179102Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.0179107Z 
2025-05-07T20:32:22.0179185Z     @given(
2025-05-07T20:32:22.0179306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0179402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0179521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0179635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0179745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0179825Z     )
2025-05-07T20:32:22.0180068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0180166Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0180242Z         self,
2025-05-07T20:32:22.0180319Z         T: int,
2025-05-07T20:32:22.0180403Z         D: int,
2025-05-07T20:32:22.0180505Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0180634Z         contiguous: bool,
2025-05-07T20:32:22.0180722Z         compiled: bool,
2025-05-07T20:32:22.0180797Z     ) -> None:
2025-05-07T20:32:22.0180890Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0180967Z     
2025-05-07T20:32:22.0181135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0181208Z     
2025-05-07T20:32:22.0181301Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0181424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0181508Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0181592Z         x0 = x[:, :D]
2025-05-07T20:32:22.0181673Z         x1 = x[:, D:]
2025-05-07T20:32:22.0181749Z     
2025-05-07T20:32:22.0181830Z         if contiguous:
2025-05-07T20:32:22.0181917Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0182008Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0182084Z     
2025-05-07T20:32:22.0182173Z         if scale_ub is not None:
2025-05-07T20:32:22.0182286Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0182420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0182494Z             )
2025-05-07T20:32:22.0182574Z         else:
2025-05-07T20:32:22.0182667Z             scale_ub_tensor = None
2025-05-07T20:32:22.0182739Z     
2025-05-07T20:32:22.0182873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0182962Z             op = silu_mul_quant
2025-05-07T20:32:22.0183050Z             if compiled:
2025-05-07T20:32:22.0183150Z                 op = torch.compile(op)
2025-05-07T20:32:22.0183251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0183326Z     
2025-05-07T20:32:22.0183414Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0183419Z 
2025-05-07T20:32:22.0183517Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0183652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0183798Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0183905Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0184408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0184506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0184873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0185094Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0185435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0185533Z     kernel = self.compile(
2025-05-07T20:32:22.0185919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0186133Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0186304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0186311Z 
2025-05-07T20:32:22.0186514Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3ca5010>
2025-05-07T20:32:22.0187290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0187787Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3a89f80>}
2025-05-07T20:32:22.0188543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0188735Z context = <triton._C.libtriton.ir.context object at 0x7f18a3c35630>
2025-05-07T20:32:22.0188803Z 
2025-05-07T20:32:22.0188971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0189242Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0189348Z                            module_map=module_map)
2025-05-07T20:32:22.0189516Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0189611Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0189689Z E       ^
2025-05-07T20:32:22.0190048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0190053Z 
2025-05-07T20:32:22.0190467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0190472Z 
2025-05-07T20:32:22.0190580Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0190806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0190888Z     T=1,
2025-05-07T20:32:22.0190972Z     D=5120,
2025-05-07T20:32:22.0191051Z     scale_ub=None,
2025-05-07T20:32:22.0191155Z     contiguous=True,
2025-05-07T20:32:22.0191248Z     compiled=True,
2025-05-07T20:32:22.0191336Z )
2025-05-07T20:32:22.0191558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0191722Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0191727Z 
2025-05-07T20:32:22.0191804Z     @given(
2025-05-07T20:32:22.0191921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0192023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0192134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0192255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0192367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0192443Z     )
2025-05-07T20:32:22.0192737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0192830Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0192906Z         self,
2025-05-07T20:32:22.0192986Z         T: int,
2025-05-07T20:32:22.0193059Z         D: int,
2025-05-07T20:32:22.0193153Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0193243Z         contiguous: bool,
2025-05-07T20:32:22.0193327Z         compiled: bool,
2025-05-07T20:32:22.0193403Z     ) -> None:
2025-05-07T20:32:22.0193502Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0193574Z     
2025-05-07T20:32:22.0193748Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0193821Z     
2025-05-07T20:32:22.0193910Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0194038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0194126Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0194249Z         x0 = x[:, :D]
2025-05-07T20:32:22.0194372Z         x1 = x[:, D:]
2025-05-07T20:32:22.0194445Z     
2025-05-07T20:32:22.0194529Z         if contiguous:
2025-05-07T20:32:22.0194625Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0194714Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0194787Z     
2025-05-07T20:32:22.0194881Z         if scale_ub is not None:
2025-05-07T20:32:22.0194984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0195124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0195200Z             )
2025-05-07T20:32:22.0195276Z         else:
2025-05-07T20:32:22.0195373Z             scale_ub_tensor = None
2025-05-07T20:32:22.0195445Z     
2025-05-07T20:32:22.0195575Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0195670Z             op = silu_mul_quant
2025-05-07T20:32:22.0195754Z             if compiled:
2025-05-07T20:32:22.0195851Z                 op = torch.compile(op)
2025-05-07T20:32:22.0195962Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0196080Z     
2025-05-07T20:32:22.0196171Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0196296Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0196369Z     
2025-05-07T20:32:22.0196507Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0196608Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0196705Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0196831Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0196969Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0197042Z     
2025-05-07T20:32:22.0197147Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0197151Z 
2025-05-07T20:32:22.0197250Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0197375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0197487Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0197624Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0198202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0198301Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0198663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0198887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0199256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0199515Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0199916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0200219Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0200614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0200778Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0201122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0201203Z     fn()
2025-05-07T20:32:22.0201607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0201693Z     self.fn.run(
2025-05-07T20:32:22.0202033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0202124Z     kernel = self.compile(
2025-05-07T20:32:22.0202552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0202770Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0202897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0202907Z 
2025-05-07T20:32:22.0203113Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a310f190>
2025-05-07T20:32:22.0203962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0204466Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3a8afc0>}
2025-05-07T20:32:22.0205224Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0205467Z context = <triton._C.libtriton.ir.context object at 0x7f18a31e79f0>
2025-05-07T20:32:22.0205472Z 
2025-05-07T20:32:22.0205636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0205900Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0206008Z                            module_map=module_map)
2025-05-07T20:32:22.0206170Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0206274Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0206350Z E       ^
2025-05-07T20:32:22.0206706Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0206710Z 
2025-05-07T20:32:22.0207135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0207142Z 
2025-05-07T20:32:22.0207251Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0207479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0207554Z     T=2048,
2025-05-07T20:32:22.0207628Z     D=5120,
2025-05-07T20:32:22.0207716Z     scale_ub=None,
2025-05-07T20:32:22.0207801Z     contiguous=True,
2025-05-07T20:32:22.0207884Z     compiled=True,
2025-05-07T20:32:22.0207958Z )
2025-05-07T20:32:22.0208177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0208345Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0208350Z 
2025-05-07T20:32:22.0208429Z     @given(
2025-05-07T20:32:22.0208547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0208644Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0208764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0208883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0209049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0209126Z     )
2025-05-07T20:32:22.0209368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0209466Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0209543Z         self,
2025-05-07T20:32:22.0209619Z         T: int,
2025-05-07T20:32:22.0209701Z         D: int,
2025-05-07T20:32:22.0209797Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0209885Z         contiguous: bool,
2025-05-07T20:32:22.0209973Z         compiled: bool,
2025-05-07T20:32:22.0210051Z     ) -> None:
2025-05-07T20:32:22.0210143Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0210220Z     
2025-05-07T20:32:22.0210388Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0210465Z     
2025-05-07T20:32:22.0210555Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0210719Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0210851Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0210933Z         x0 = x[:, :D]
2025-05-07T20:32:22.0211010Z         x1 = x[:, D:]
2025-05-07T20:32:22.0211085Z     
2025-05-07T20:32:22.0211168Z         if contiguous:
2025-05-07T20:32:22.0211259Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0211348Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0211418Z     
2025-05-07T20:32:22.0211503Z         if scale_ub is not None:
2025-05-07T20:32:22.0211609Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0211742Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0211821Z             )
2025-05-07T20:32:22.0211897Z         else:
2025-05-07T20:32:22.0211997Z             scale_ub_tensor = None
2025-05-07T20:32:22.0212073Z     
2025-05-07T20:32:22.0212202Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0212289Z             op = silu_mul_quant
2025-05-07T20:32:22.0212379Z             if compiled:
2025-05-07T20:32:22.0212484Z                 op = torch.compile(op)
2025-05-07T20:32:22.0212632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0212707Z     
2025-05-07T20:32:22.0212796Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0212916Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0212990Z     
2025-05-07T20:32:22.0213124Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0213230Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0213325Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0213443Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0213586Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0213659Z     
2025-05-07T20:32:22.0213756Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0213761Z 
2025-05-07T20:32:22.0213866Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0213996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0214105Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0214243Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0214803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0214910Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0215272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0215494Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0215866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0216120Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0216570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0216831Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0217206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0217375Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0217718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0217801Z     fn()
2025-05-07T20:32:22.0218206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0218288Z     self.fn.run(
2025-05-07T20:32:22.0218630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0218762Z     kernel = self.compile(
2025-05-07T20:32:22.0219213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0219392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0219518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0219523Z 
2025-05-07T20:32:22.0219729Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3425fd0>
2025-05-07T20:32:22.0220517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0221051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a37bf420>}
2025-05-07T20:32:22.0221810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0222043Z context = <triton._C.libtriton.ir.context object at 0x7f18a3419470>
2025-05-07T20:32:22.0222048Z 
2025-05-07T20:32:22.0222212Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0222480Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0222583Z                            module_map=module_map)
2025-05-07T20:32:22.0222749Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0222849Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0222929Z E       ^
2025-05-07T20:32:22.0223283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0223289Z 
2025-05-07T20:32:22.0223707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0223718Z 
2025-05-07T20:32:22.0223825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0224049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0224133Z     T=128,
2025-05-07T20:32:22.0224209Z     D=5120,
2025-05-07T20:32:22.0224292Z     scale_ub=None,
2025-05-07T20:32:22.0224380Z     contiguous=True,
2025-05-07T20:32:22.0224462Z     compiled=True,
2025-05-07T20:32:22.0224534Z )
2025-05-07T20:32:22.0224758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0224924Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0224929Z 
2025-05-07T20:32:22.0225006Z     @given(
2025-05-07T20:32:22.0225129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0225226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0225383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0225508Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0225619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0225695Z     )
2025-05-07T20:32:22.0225937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0226029Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0226105Z         self,
2025-05-07T20:32:22.0226180Z         T: int,
2025-05-07T20:32:22.0226259Z         D: int,
2025-05-07T20:32:22.0226358Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0226446Z         contiguous: bool,
2025-05-07T20:32:22.0226528Z         compiled: bool,
2025-05-07T20:32:22.0226606Z     ) -> None:
2025-05-07T20:32:22.0226699Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0226771Z     
2025-05-07T20:32:22.0226943Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0227063Z     
2025-05-07T20:32:22.0227167Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0227330Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0227416Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0227502Z         x0 = x[:, :D]
2025-05-07T20:32:22.0227578Z         x1 = x[:, D:]
2025-05-07T20:32:22.0227652Z     
2025-05-07T20:32:22.0227737Z         if contiguous:
2025-05-07T20:32:22.0227826Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0227914Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0227991Z     
2025-05-07T20:32:22.0228077Z         if scale_ub is not None:
2025-05-07T20:32:22.0228181Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0228319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0228395Z             )
2025-05-07T20:32:22.0228475Z         else:
2025-05-07T20:32:22.0228566Z             scale_ub_tensor = None
2025-05-07T20:32:22.0228637Z     
2025-05-07T20:32:22.0228772Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0228863Z             op = silu_mul_quant
2025-05-07T20:32:22.0228989Z             if compiled:
2025-05-07T20:32:22.0229094Z                 op = torch.compile(op)
2025-05-07T20:32:22.0229196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0229266Z     
2025-05-07T20:32:22.0229358Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0229478Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0229549Z     
2025-05-07T20:32:22.0229687Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0229784Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0229887Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0230007Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0230147Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0245871Z     
2025-05-07T20:32:22.0246020Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0246027Z 
2025-05-07T20:32:22.0246138Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0246285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0246391Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0246532Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0247108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0247214Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0247575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0247806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0248180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0248558Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0248968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0249224Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0249610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0249779Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0250128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0250205Z     fn()
2025-05-07T20:32:22.0250609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0250697Z     self.fn.run(
2025-05-07T20:32:22.0251099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0251248Z     kernel = self.compile(
2025-05-07T20:32:22.0251641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0251816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0251952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0251957Z 
2025-05-07T20:32:22.0252164Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2e55d90>
2025-05-07T20:32:22.0252940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0253452Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3045e40>}
2025-05-07T20:32:22.0254266Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0254460Z context = <triton._C.libtriton.ir.context object at 0x7f18a2e662b0>
2025-05-07T20:32:22.0254465Z 
2025-05-07T20:32:22.0254630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0254903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0255011Z                            module_map=module_map)
2025-05-07T20:32:22.0255173Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0255279Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0255357Z E       ^
2025-05-07T20:32:22.0255718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0255728Z 
2025-05-07T20:32:22.0256149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0256154Z 
2025-05-07T20:32:22.0256257Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0256486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0256562Z     T=4096,
2025-05-07T20:32:22.0256638Z     D=5120,
2025-05-07T20:32:22.0256726Z     scale_ub=None,
2025-05-07T20:32:22.0256812Z     contiguous=True,
2025-05-07T20:32:22.0256895Z     compiled=True,
2025-05-07T20:32:22.0256975Z )
2025-05-07T20:32:22.0257195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0257367Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0257378Z 
2025-05-07T20:32:22.0257456Z     @given(
2025-05-07T20:32:22.0257577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0257726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0257844Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0257960Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0258078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0258153Z     )
2025-05-07T20:32:22.0258399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0258495Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0258573Z         self,
2025-05-07T20:32:22.0258652Z         T: int,
2025-05-07T20:32:22.0258733Z         D: int,
2025-05-07T20:32:22.0258832Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0258932Z         contiguous: bool,
2025-05-07T20:32:22.0259017Z         compiled: bool,
2025-05-07T20:32:22.0259095Z     ) -> None:
2025-05-07T20:32:22.0259197Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0259311Z     
2025-05-07T20:32:22.0259489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0259610Z     
2025-05-07T20:32:22.0259701Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0259825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0259919Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0259997Z         x0 = x[:, :D]
2025-05-07T20:32:22.0260074Z         x1 = x[:, D:]
2025-05-07T20:32:22.0260149Z     
2025-05-07T20:32:22.0260230Z         if contiguous:
2025-05-07T20:32:22.0260323Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0260413Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0260484Z     
2025-05-07T20:32:22.0260578Z         if scale_ub is not None:
2025-05-07T20:32:22.0260682Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0260816Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0260899Z             )
2025-05-07T20:32:22.0260975Z         else:
2025-05-07T20:32:22.0261070Z             scale_ub_tensor = None
2025-05-07T20:32:22.0261146Z     
2025-05-07T20:32:22.0261405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0261494Z             op = silu_mul_quant
2025-05-07T20:32:22.0261585Z             if compiled:
2025-05-07T20:32:22.0261683Z                 op = torch.compile(op)
2025-05-07T20:32:22.0261788Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0261861Z     
2025-05-07T20:32:22.0261951Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0262076Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0262147Z     
2025-05-07T20:32:22.0262281Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0262386Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0262488Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0262610Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0262758Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0262827Z     
2025-05-07T20:32:22.0262930Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0262944Z 
2025-05-07T20:32:22.0263042Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0263170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0263281Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0263416Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0263976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0264081Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0264440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0264668Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0265081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0265345Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0265745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0265997Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0266369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0266541Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0266884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0266971Z     fn()
2025-05-07T20:32:22.0267413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0267531Z     self.fn.run(
2025-05-07T20:32:22.0267882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0267979Z     kernel = self.compile(
2025-05-07T20:32:22.0268362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0268543Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0268667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0268672Z 
2025-05-07T20:32:22.0268883Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a3871510>
2025-05-07T20:32:22.0269654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0270161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a3352840>}
2025-05-07T20:32:22.0270960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0271147Z context = <triton._C.libtriton.ir.context object at 0x7f18a3875ab0>
2025-05-07T20:32:22.0271151Z 
2025-05-07T20:32:22.0271321Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0271585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0271696Z                            module_map=module_map)
2025-05-07T20:32:22.0271856Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0271959Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0272043Z E       ^
2025-05-07T20:32:22.0272403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0272411Z 
2025-05-07T20:32:22.0272823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0272833Z 
2025-05-07T20:32:22.0272934Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0273155Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0273240Z     T=16384,
2025-05-07T20:32:22.0273319Z     D=5120,
2025-05-07T20:32:22.0273397Z     scale_ub=None,
2025-05-07T20:32:22.0273488Z     contiguous=True,
2025-05-07T20:32:22.0273571Z     compiled=True,
2025-05-07T20:32:22.0273643Z )
2025-05-07T20:32:22.0273865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0274041Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0274045Z 
2025-05-07T20:32:22.0274198Z     @given(
2025-05-07T20:32:22.0274323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0274420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0274537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0274648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0274763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0274838Z     )
2025-05-07T20:32:22.0275090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0275184Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0275257Z         self,
2025-05-07T20:32:22.0275340Z         T: int,
2025-05-07T20:32:22.0275415Z         D: int,
2025-05-07T20:32:22.0275513Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0275599Z         contiguous: bool,
2025-05-07T20:32:22.0275681Z         compiled: bool,
2025-05-07T20:32:22.0275804Z     ) -> None:
2025-05-07T20:32:22.0275902Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0276014Z     
2025-05-07T20:32:22.0276188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0276260Z     
2025-05-07T20:32:22.0276349Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0276474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0276562Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0276640Z         x0 = x[:, :D]
2025-05-07T20:32:22.0276718Z         x1 = x[:, D:]
2025-05-07T20:32:22.0276788Z     
2025-05-07T20:32:22.0276873Z         if contiguous:
2025-05-07T20:32:22.0276960Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0277050Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0277128Z     
2025-05-07T20:32:22.0277213Z         if scale_ub is not None:
2025-05-07T20:32:22.0277318Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0277462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0277537Z             )
2025-05-07T20:32:22.0277617Z         else:
2025-05-07T20:32:22.0277760Z             scale_ub_tensor = None
2025-05-07T20:32:22.0277831Z     
2025-05-07T20:32:22.0277960Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0278056Z             op = silu_mul_quant
2025-05-07T20:32:22.0278138Z             if compiled:
2025-05-07T20:32:22.0278234Z                 op = torch.compile(op)
2025-05-07T20:32:22.0278339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0278410Z     
2025-05-07T20:32:22.0278499Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0278619Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0278690Z     
2025-05-07T20:32:22.0278829Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0278930Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0279027Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0279155Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0279295Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0279371Z     
2025-05-07T20:32:22.0279473Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0279478Z 
2025-05-07T20:32:22.0279575Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0279704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0279807Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0279937Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0280498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0280597Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0280956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0281183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0281598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0281853Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0282251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0282501Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0282884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0283046Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0283493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0283570Z     fn()
2025-05-07T20:32:22.0284012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0284138Z     self.fn.run(
2025-05-07T20:32:22.0284475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0284566Z     kernel = self.compile(
2025-05-07T20:32:22.0284951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0285124Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0285249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0285254Z 
2025-05-07T20:32:22.0285458Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a353fe50>
2025-05-07T20:32:22.0286227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0286769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2b19d00>}
2025-05-07T20:32:22.0287516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0287717Z context = <triton._C.libtriton.ir.context object at 0x7f18a3567ef0>
2025-05-07T20:32:22.0287722Z 
2025-05-07T20:32:22.0287886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0288153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0288259Z                            module_map=module_map)
2025-05-07T20:32:22.0288423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0288530Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0288604Z E       ^
2025-05-07T20:32:22.0288958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0288963Z 
2025-05-07T20:32:22.0289383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0289388Z 
2025-05-07T20:32:22.0289491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0289714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0289790Z     T=1,
2025-05-07T20:32:22.0289861Z     D=5120,
2025-05-07T20:32:22.0289943Z     scale_ub=1200.0,
2025-05-07T20:32:22.0290025Z     contiguous=True,
2025-05-07T20:32:22.0290107Z     compiled=True,
2025-05-07T20:32:22.0290182Z )
2025-05-07T20:32:22.0290405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0290611Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.0290623Z 
2025-05-07T20:32:22.0290700Z     @given(
2025-05-07T20:32:22.0290820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0290915Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0291030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0291144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0291258Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0291330Z     )
2025-05-07T20:32:22.0291572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0291669Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0291743Z         self,
2025-05-07T20:32:22.0291817Z         T: int,
2025-05-07T20:32:22.0291897Z         D: int,
2025-05-07T20:32:22.0291991Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0292125Z         contiguous: bool,
2025-05-07T20:32:22.0292251Z         compiled: bool,
2025-05-07T20:32:22.0292330Z     ) -> None:
2025-05-07T20:32:22.0292428Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0292493Z     
2025-05-07T20:32:22.0292662Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0292736Z     
2025-05-07T20:32:22.0292826Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0292949Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0293045Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0293123Z         x0 = x[:, :D]
2025-05-07T20:32:22.0293200Z         x1 = x[:, D:]
2025-05-07T20:32:22.0293273Z     
2025-05-07T20:32:22.0293355Z         if contiguous:
2025-05-07T20:32:22.0293446Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0293534Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0293605Z     
2025-05-07T20:32:22.0293699Z         if scale_ub is not None:
2025-05-07T20:32:22.0293803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0293938Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0294062Z             )
2025-05-07T20:32:22.0294136Z         else:
2025-05-07T20:32:22.0294229Z             scale_ub_tensor = None
2025-05-07T20:32:22.0294306Z     
2025-05-07T20:32:22.0294434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0294522Z             op = silu_mul_quant
2025-05-07T20:32:22.0294607Z             if compiled:
2025-05-07T20:32:22.0294706Z                 op = torch.compile(op)
2025-05-07T20:32:22.0294810Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0294886Z     
2025-05-07T20:32:22.0294974Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0294978Z 
2025-05-07T20:32:22.0295078Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0295203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0295300Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0295405Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0295784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0295879Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0296377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0296473Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0296833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0297053Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0297392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0297486Z     kernel = self.compile(
2025-05-07T20:32:22.0297869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0298088Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0298218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0298222Z 
2025-05-07T20:32:22.0298426Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a21be650>
2025-05-07T20:32:22.0299197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0299690Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2d68220>}
2025-05-07T20:32:22.0300510Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0300762Z context = <triton._C.libtriton.ir.context object at 0x7f18a20961b0>
2025-05-07T20:32:22.0300767Z 
2025-05-07T20:32:22.0300929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0301195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0301299Z                            module_map=module_map)
2025-05-07T20:32:22.0301461Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0301557Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0301633Z E       ^
2025-05-07T20:32:22.0301996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0302001Z 
2025-05-07T20:32:22.0302413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0302418Z 
2025-05-07T20:32:22.0302530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0302813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0302888Z     T=1,
2025-05-07T20:32:22.0302969Z     D=5120,
2025-05-07T20:32:22.0303048Z     scale_ub=None,
2025-05-07T20:32:22.0303131Z     contiguous=False,
2025-05-07T20:32:22.0303218Z     compiled=True,
2025-05-07T20:32:22.0303289Z )
2025-05-07T20:32:22.0303502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0303667Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0303671Z 
2025-05-07T20:32:22.0303748Z     @given(
2025-05-07T20:32:22.0303870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0303969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0304079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0304204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0304319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0304393Z     )
2025-05-07T20:32:22.0304641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0304732Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0304807Z         self,
2025-05-07T20:32:22.0304886Z         T: int,
2025-05-07T20:32:22.0304960Z         D: int,
2025-05-07T20:32:22.0305057Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0305147Z         contiguous: bool,
2025-05-07T20:32:22.0305228Z         compiled: bool,
2025-05-07T20:32:22.0305305Z     ) -> None:
2025-05-07T20:32:22.0305393Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0305462Z     
2025-05-07T20:32:22.0305628Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0305698Z     
2025-05-07T20:32:22.0305785Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0305915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0306045Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0306125Z         x0 = x[:, :D]
2025-05-07T20:32:22.0306204Z         x1 = x[:, D:]
2025-05-07T20:32:22.0306274Z     
2025-05-07T20:32:22.0306357Z         if contiguous:
2025-05-07T20:32:22.0306447Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0306534Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0306606Z     
2025-05-07T20:32:22.0306696Z         if scale_ub is not None:
2025-05-07T20:32:22.0306798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0306937Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0307008Z             )
2025-05-07T20:32:22.0307081Z         else:
2025-05-07T20:32:22.0307178Z             scale_ub_tensor = None
2025-05-07T20:32:22.0307247Z     
2025-05-07T20:32:22.0307375Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0307469Z             op = silu_mul_quant
2025-05-07T20:32:22.0307595Z             if compiled:
2025-05-07T20:32:22.0307700Z                 op = torch.compile(op)
2025-05-07T20:32:22.0307849Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0307920Z     
2025-05-07T20:32:22.0308010Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0308136Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0308202Z     
2025-05-07T20:32:22.0308336Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0308442Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0308539Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0308658Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0308806Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0308875Z     
2025-05-07T20:32:22.0308976Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0308980Z 
2025-05-07T20:32:22.0309077Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0309206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0309361Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0309491Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0310045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0310148Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0310510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0310741Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0311145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0311414Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0311821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0312074Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0312454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0312616Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0312957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0313036Z     fn()
2025-05-07T20:32:22.0313435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0313516Z     self.fn.run(
2025-05-07T20:32:22.0313860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0313953Z     kernel = self.compile(
2025-05-07T20:32:22.0314380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0314559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0314683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0314688Z 
2025-05-07T20:32:22.0314895Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a20b95d0>
2025-05-07T20:32:22.0315664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0316162Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2d8e200>}
2025-05-07T20:32:22.0316953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0317181Z context = <triton._C.libtriton.ir.context object at 0x7f18a203db70>
2025-05-07T20:32:22.0317189Z 
2025-05-07T20:32:22.0317354Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0317614Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0317725Z                            module_map=module_map)
2025-05-07T20:32:22.0317884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0317980Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0318062Z E       ^
2025-05-07T20:32:22.0318416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0318421Z 
2025-05-07T20:32:22.0318839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0318885Z 
2025-05-07T20:32:22.0318987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0319206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0319286Z     T=1,
2025-05-07T20:32:22.0319361Z     D=5120,
2025-05-07T20:32:22.0319440Z     scale_ub=None,
2025-05-07T20:32:22.0319526Z     contiguous=True,
2025-05-07T20:32:22.0319610Z     compiled=False,
2025-05-07T20:32:22.0319683Z )
2025-05-07T20:32:22.0319905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0320065Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0320070Z 
2025-05-07T20:32:22.0320152Z     @given(
2025-05-07T20:32:22.0320267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0320362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0320484Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0320605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0320716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0320789Z     )
2025-05-07T20:32:22.0321030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0321120Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0321198Z         self,
2025-05-07T20:32:22.0321271Z         T: int,
2025-05-07T20:32:22.0321349Z         D: int,
2025-05-07T20:32:22.0321443Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0321531Z         contiguous: bool,
2025-05-07T20:32:22.0321622Z         compiled: bool,
2025-05-07T20:32:22.0321696Z     ) -> None:
2025-05-07T20:32:22.0321789Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0321861Z     
2025-05-07T20:32:22.0322030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0322101Z     
2025-05-07T20:32:22.0322195Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0322359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0322449Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0322532Z         x0 = x[:, :D]
2025-05-07T20:32:22.0322608Z         x1 = x[:, D:]
2025-05-07T20:32:22.0322682Z     
2025-05-07T20:32:22.0322763Z         if contiguous:
2025-05-07T20:32:22.0322848Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0322942Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0323012Z     
2025-05-07T20:32:22.0323099Z         if scale_ub is not None:
2025-05-07T20:32:22.0323204Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0323452Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0323528Z             )
2025-05-07T20:32:22.0323610Z         else:
2025-05-07T20:32:22.0323706Z             scale_ub_tensor = None
2025-05-07T20:32:22.0323776Z     
2025-05-07T20:32:22.0323954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0324041Z             op = silu_mul_quant
2025-05-07T20:32:22.0324164Z             if compiled:
2025-05-07T20:32:22.0324268Z                 op = torch.compile(op)
2025-05-07T20:32:22.0324369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0324443Z     
2025-05-07T20:32:22.0324531Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0324535Z 
2025-05-07T20:32:22.0324632Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0324761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0324858Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0324955Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0325456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0325550Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0325917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0326143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0326526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0326625Z     kernel = self.compile(
2025-05-07T20:32:22.0327010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0327180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0327315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0327320Z 
2025-05-07T20:32:22.0327522Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a20e5650>
2025-05-07T20:32:22.0328301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0328800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2d8f560>}
2025-05-07T20:32:22.0329553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0329740Z context = <triton._C.libtriton.ir.context object at 0x7f18a20c1b70>
2025-05-07T20:32:22.0329745Z 
2025-05-07T20:32:22.0329908Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0330174Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0330278Z                            module_map=module_map)
2025-05-07T20:32:22.0330445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0330583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0330665Z E       ^
2025-05-07T20:32:22.0331023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0331028Z 
2025-05-07T20:32:22.0331439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0331444Z 
2025-05-07T20:32:22.0331542Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0331767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0331842Z     T=128,
2025-05-07T20:32:22.0331922Z     D=5120,
2025-05-07T20:32:22.0331998Z     scale_ub=None,
2025-05-07T20:32:22.0332082Z     contiguous=False,
2025-05-07T20:32:22.0332169Z     compiled=True,
2025-05-07T20:32:22.0332238Z )
2025-05-07T20:32:22.0332495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0332670Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0332738Z 
2025-05-07T20:32:22.0332815Z     @given(
2025-05-07T20:32:22.0332931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0333027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0333138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0333256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0333366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0333438Z     )
2025-05-07T20:32:22.0333685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0333778Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0333853Z         self,
2025-05-07T20:32:22.0333931Z         T: int,
2025-05-07T20:32:22.0334005Z         D: int,
2025-05-07T20:32:22.0334099Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0334194Z         contiguous: bool,
2025-05-07T20:32:22.0334276Z         compiled: bool,
2025-05-07T20:32:22.0334400Z     ) -> None:
2025-05-07T20:32:22.0334498Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0334570Z     
2025-05-07T20:32:22.0334743Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0334817Z     
2025-05-07T20:32:22.0334906Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0335028Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0335116Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0335195Z         x0 = x[:, :D]
2025-05-07T20:32:22.0335279Z         x1 = x[:, D:]
2025-05-07T20:32:22.0335348Z     
2025-05-07T20:32:22.0335427Z         if contiguous:
2025-05-07T20:32:22.0335519Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0335607Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0335677Z     
2025-05-07T20:32:22.0335770Z         if scale_ub is not None:
2025-05-07T20:32:22.0335871Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0336011Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0336089Z             )
2025-05-07T20:32:22.0336164Z         else:
2025-05-07T20:32:22.0336262Z             scale_ub_tensor = None
2025-05-07T20:32:22.0336330Z     
2025-05-07T20:32:22.0336456Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0336546Z             op = silu_mul_quant
2025-05-07T20:32:22.0336625Z             if compiled:
2025-05-07T20:32:22.0336721Z                 op = torch.compile(op)
2025-05-07T20:32:22.0336826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0336894Z     
2025-05-07T20:32:22.0336983Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0336987Z 
2025-05-07T20:32:22.0337085Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0337211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0337311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0337410Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0337825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0337927Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0338600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0338736Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0339146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0339369Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0339713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0339807Z     kernel = self.compile(
2025-05-07T20:32:22.0340272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0340456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0340665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0340671Z 
2025-05-07T20:32:22.0340899Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2225110>
2025-05-07T20:32:22.0341669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0342165Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2d696c0>}
2025-05-07T20:32:22.0342919Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0343108Z context = <triton._C.libtriton.ir.context object at 0x7f18a2229730>
2025-05-07T20:32:22.0343171Z 
2025-05-07T20:32:22.0343344Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0343604Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0343709Z                            module_map=module_map)
2025-05-07T20:32:22.0343877Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0343974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0344053Z E       ^
2025-05-07T20:32:22.0344409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0344414Z 
2025-05-07T20:32:22.0344826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0344833Z 
2025-05-07T20:32:22.0344950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0345176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0345255Z     T=128,
2025-05-07T20:32:22.0345329Z     D=7168,
2025-05-07T20:32:22.0345413Z     scale_ub=1200.0,
2025-05-07T20:32:22.0345503Z     contiguous=False,
2025-05-07T20:32:22.0345588Z     compiled=False,
2025-05-07T20:32:22.0345665Z )
2025-05-07T20:32:22.0345887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0346057Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0346061Z 
2025-05-07T20:32:22.0346135Z     @given(
2025-05-07T20:32:22.0346258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0346353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0346465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0346586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0346770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0346860Z     )
2025-05-07T20:32:22.0347103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0347194Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0347270Z         self,
2025-05-07T20:32:22.0347347Z         T: int,
2025-05-07T20:32:22.0347421Z         D: int,
2025-05-07T20:32:22.0347523Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0347612Z         contiguous: bool,
2025-05-07T20:32:22.0347699Z         compiled: bool,
2025-05-07T20:32:22.0347780Z     ) -> None:
2025-05-07T20:32:22.0347872Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0347946Z     
2025-05-07T20:32:22.0348117Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0348190Z     
2025-05-07T20:32:22.0348287Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0348451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0348537Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0348670Z         x0 = x[:, :D]
2025-05-07T20:32:22.0348750Z         x1 = x[:, D:]
2025-05-07T20:32:22.0348821Z     
2025-05-07T20:32:22.0348908Z         if contiguous:
2025-05-07T20:32:22.0348997Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0349084Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0349160Z     
2025-05-07T20:32:22.0349245Z         if scale_ub is not None:
2025-05-07T20:32:22.0349348Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0349484Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0349559Z             )
2025-05-07T20:32:22.0349640Z         else:
2025-05-07T20:32:22.0349731Z             scale_ub_tensor = None
2025-05-07T20:32:22.0349802Z     
2025-05-07T20:32:22.0349934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0350021Z             op = silu_mul_quant
2025-05-07T20:32:22.0350102Z             if compiled:
2025-05-07T20:32:22.0350205Z                 op = torch.compile(op)
2025-05-07T20:32:22.0350356Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0350429Z     
2025-05-07T20:32:22.0350523Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0350528Z 
2025-05-07T20:32:22.0350636Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0350779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0350900Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0350996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0351498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0351592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0351949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0352175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0352524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0352621Z     kernel = self.compile(
2025-05-07T20:32:22.0353003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0353175Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0353296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0353301Z 
2025-05-07T20:32:22.0353503Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a223d210>
2025-05-07T20:32:22.0354272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0354809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2a7cae0>}
2025-05-07T20:32:22.0355561Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0355750Z context = <triton._C.libtriton.ir.context object at 0x7f18a2259870>
2025-05-07T20:32:22.0355755Z 
2025-05-07T20:32:22.0355918Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0356184Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0356289Z                            module_map=module_map)
2025-05-07T20:32:22.0356449Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0356548Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0356665Z E       ^
2025-05-07T20:32:22.0357021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0357073Z 
2025-05-07T20:32:22.0357485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0357489Z 
2025-05-07T20:32:22.0357591Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0357812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0357888Z     T=128,
2025-05-07T20:32:22.0357962Z     D=5120,
2025-05-07T20:32:22.0358052Z     scale_ub=None,
2025-05-07T20:32:22.0358137Z     contiguous=False,
2025-05-07T20:32:22.0358222Z     compiled=False,
2025-05-07T20:32:22.0358302Z )
2025-05-07T20:32:22.0358516Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0358689Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0358696Z 
2025-05-07T20:32:22.0358771Z     @given(
2025-05-07T20:32:22.0358894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0359042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0359154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0359267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0359386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0359457Z     )
2025-05-07T20:32:22.0359697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0359789Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0359863Z         self,
2025-05-07T20:32:22.0359944Z         T: int,
2025-05-07T20:32:22.0360021Z         D: int,
2025-05-07T20:32:22.0360116Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0360207Z         contiguous: bool,
2025-05-07T20:32:22.0360288Z         compiled: bool,
2025-05-07T20:32:22.0360362Z     ) -> None:
2025-05-07T20:32:22.0360461Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0360538Z     
2025-05-07T20:32:22.0360712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0360789Z     
2025-05-07T20:32:22.0360880Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0361007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0361097Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0361171Z         x0 = x[:, :D]
2025-05-07T20:32:22.0361253Z         x1 = x[:, D:]
2025-05-07T20:32:22.0361326Z     
2025-05-07T20:32:22.0361407Z         if contiguous:
2025-05-07T20:32:22.0361506Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0361595Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0361665Z     
2025-05-07T20:32:22.0361758Z         if scale_ub is not None:
2025-05-07T20:32:22.0361861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0361991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0362073Z             )
2025-05-07T20:32:22.0362147Z         else:
2025-05-07T20:32:22.0362283Z             scale_ub_tensor = None
2025-05-07T20:32:22.0362365Z     
2025-05-07T20:32:22.0362490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0362579Z             op = silu_mul_quant
2025-05-07T20:32:22.0362664Z             if compiled:
2025-05-07T20:32:22.0362762Z                 op = torch.compile(op)
2025-05-07T20:32:22.0362867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0362936Z     
2025-05-07T20:32:22.0363027Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0363031Z 
2025-05-07T20:32:22.0363132Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0363260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0363483Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0363584Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0364151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0364291Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0364652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0364874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0365221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0365313Z     kernel = self.compile(
2025-05-07T20:32:22.0365698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0365880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0366005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0366009Z 
2025-05-07T20:32:22.0366216Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a21f0610>
2025-05-07T20:32:22.0366990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0367528Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a212c720>}
2025-05-07T20:32:22.0368283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0368471Z context = <triton._C.libtriton.ir.context object at 0x7f18a21083b0>
2025-05-07T20:32:22.0368476Z 
2025-05-07T20:32:22.0368644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0368904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0369018Z                            module_map=module_map)
2025-05-07T20:32:22.0369178Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0369272Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0369353Z E       ^
2025-05-07T20:32:22.0369704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0369708Z 
2025-05-07T20:32:22.0370120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0370125Z 
2025-05-07T20:32:22.0370232Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0370450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0370527Z     T=128,
2025-05-07T20:32:22.0370600Z     D=5120,
2025-05-07T20:32:22.0370683Z     scale_ub=1200.0,
2025-05-07T20:32:22.0375286Z     contiguous=True,
2025-05-07T20:32:22.0375465Z     compiled=False,
2025-05-07T20:32:22.0375555Z )
2025-05-07T20:32:22.0375781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0375957Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.0375962Z 
2025-05-07T20:32:22.0376041Z     @given(
2025-05-07T20:32:22.0376163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0376271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0376381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0376498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0376616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0376690Z     )
2025-05-07T20:32:22.0376935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0377039Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0377159Z         self,
2025-05-07T20:32:22.0377246Z         T: int,
2025-05-07T20:32:22.0377366Z         D: int,
2025-05-07T20:32:22.0377461Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0377553Z         contiguous: bool,
2025-05-07T20:32:22.0377638Z         compiled: bool,
2025-05-07T20:32:22.0377715Z     ) -> None:
2025-05-07T20:32:22.0377814Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0377887Z     
2025-05-07T20:32:22.0378057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0378132Z     
2025-05-07T20:32:22.0378224Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0378351Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0378444Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0378521Z         x0 = x[:, :D]
2025-05-07T20:32:22.0378603Z         x1 = x[:, D:]
2025-05-07T20:32:22.0378672Z     
2025-05-07T20:32:22.0378755Z         if contiguous:
2025-05-07T20:32:22.0378852Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0378941Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0379018Z     
2025-05-07T20:32:22.0379162Z         if scale_ub is not None:
2025-05-07T20:32:22.0379265Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0379398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0379479Z             )
2025-05-07T20:32:22.0379554Z         else:
2025-05-07T20:32:22.0379645Z             scale_ub_tensor = None
2025-05-07T20:32:22.0379721Z     
2025-05-07T20:32:22.0379849Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0379936Z             op = silu_mul_quant
2025-05-07T20:32:22.0380023Z             if compiled:
2025-05-07T20:32:22.0380123Z                 op = torch.compile(op)
2025-05-07T20:32:22.0380231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0380305Z     
2025-05-07T20:32:22.0380396Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0380402Z 
2025-05-07T20:32:22.0380523Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0380680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0380787Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0380888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0381391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0381493Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0381855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0382077Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0382423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0382517Z     kernel = self.compile(
2025-05-07T20:32:22.0382904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0383132Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0383261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0383266Z 
2025-05-07T20:32:22.0383477Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a21e75d0>
2025-05-07T20:32:22.0384249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0384748Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a212d8a0>}
2025-05-07T20:32:22.0385544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0385776Z context = <triton._C.libtriton.ir.context object at 0x7f18a21afc30>
2025-05-07T20:32:22.0385781Z 
2025-05-07T20:32:22.0385952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0386215Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0386323Z                            module_map=module_map)
2025-05-07T20:32:22.0386485Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0386581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0386662Z E       ^
2025-05-07T20:32:22.0387014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0387019Z 
2025-05-07T20:32:22.0387434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0387439Z 
2025-05-07T20:32:22.0387548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0387810Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0387892Z     T=1,
2025-05-07T20:32:22.0387969Z     D=7168,
2025-05-07T20:32:22.0388053Z     scale_ub=1200.0,
2025-05-07T20:32:22.0388139Z     contiguous=True,
2025-05-07T20:32:22.0388224Z     compiled=True,
2025-05-07T20:32:22.0388296Z )
2025-05-07T20:32:22.0388517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0388679Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.0388683Z 
2025-05-07T20:32:22.0388758Z     @given(
2025-05-07T20:32:22.0388879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0388977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0389095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0389213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0389328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0389411Z     )
2025-05-07T20:32:22.0389655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0389745Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0389825Z         self,
2025-05-07T20:32:22.0389902Z         T: int,
2025-05-07T20:32:22.0389978Z         D: int,
2025-05-07T20:32:22.0390081Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0390169Z         contiguous: bool,
2025-05-07T20:32:22.0390254Z         compiled: bool,
2025-05-07T20:32:22.0390332Z     ) -> None:
2025-05-07T20:32:22.0390428Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0390506Z     
2025-05-07T20:32:22.0390675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0390748Z     
2025-05-07T20:32:22.0390843Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0390967Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0391053Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0391187Z         x0 = x[:, :D]
2025-05-07T20:32:22.0391268Z         x1 = x[:, D:]
2025-05-07T20:32:22.0391340Z     
2025-05-07T20:32:22.0391426Z         if contiguous:
2025-05-07T20:32:22.0391515Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0391603Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0391679Z     
2025-05-07T20:32:22.0391768Z         if scale_ub is not None:
2025-05-07T20:32:22.0391879Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0392011Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0392087Z             )
2025-05-07T20:32:22.0392169Z         else:
2025-05-07T20:32:22.0392264Z             scale_ub_tensor = None
2025-05-07T20:32:22.0392341Z     
2025-05-07T20:32:22.0392473Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0392565Z             op = silu_mul_quant
2025-05-07T20:32:22.0392692Z             if compiled:
2025-05-07T20:32:22.0392793Z                 op = torch.compile(op)
2025-05-07T20:32:22.0393012Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0393090Z     
2025-05-07T20:32:22.0393185Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0393189Z 
2025-05-07T20:32:22.0393284Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0393414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0393511Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0393610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0393983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0394076Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0394570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0394671Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0395033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0395306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0395645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0395739Z     kernel = self.compile(
2025-05-07T20:32:22.0396128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0396302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0396425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0396437Z 
2025-05-07T20:32:22.0396642Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2182a90>
2025-05-07T20:32:22.0397417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0397927Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a212ee80>}
2025-05-07T20:32:22.0398677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0398869Z context = <triton._C.libtriton.ir.context object at 0x7f18a2117130>
2025-05-07T20:32:22.0398874Z 
2025-05-07T20:32:22.0399038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0399301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0399412Z                            module_map=module_map)
2025-05-07T20:32:22.0399636Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0399743Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0399827Z E       ^
2025-05-07T20:32:22.0400180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0400184Z 
2025-05-07T20:32:22.0400598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0400602Z 
2025-05-07T20:32:22.0400709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0400930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0401011Z     T=1,
2025-05-07T20:32:22.0401085Z     D=7168,
2025-05-07T20:32:22.0401167Z     scale_ub=1200.0,
2025-05-07T20:32:22.0401257Z     contiguous=False,
2025-05-07T20:32:22.0401340Z     compiled=True,
2025-05-07T20:32:22.0401414Z )
2025-05-07T20:32:22.0401685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0401889Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0401894Z 
2025-05-07T20:32:22.0401969Z     @given(
2025-05-07T20:32:22.0402093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0402189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0402308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0402425Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0402536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0402615Z     )
2025-05-07T20:32:22.0402856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0402947Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0403028Z         self,
2025-05-07T20:32:22.0403103Z         T: int,
2025-05-07T20:32:22.0403178Z         D: int,
2025-05-07T20:32:22.0403286Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0403519Z         contiguous: bool,
2025-05-07T20:32:22.0403651Z         compiled: bool,
2025-05-07T20:32:22.0403736Z     ) -> None:
2025-05-07T20:32:22.0403829Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0403905Z     
2025-05-07T20:32:22.0404074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0404145Z     
2025-05-07T20:32:22.0404237Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0404358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0404444Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0404527Z         x0 = x[:, :D]
2025-05-07T20:32:22.0404604Z         x1 = x[:, D:]
2025-05-07T20:32:22.0404674Z     
2025-05-07T20:32:22.0404759Z         if contiguous:
2025-05-07T20:32:22.0404849Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0404936Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0405009Z     
2025-05-07T20:32:22.0405094Z         if scale_ub is not None:
2025-05-07T20:32:22.0405202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0405342Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0405416Z             )
2025-05-07T20:32:22.0405492Z         else:
2025-05-07T20:32:22.0405587Z             scale_ub_tensor = None
2025-05-07T20:32:22.0405659Z     
2025-05-07T20:32:22.0405791Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0405880Z             op = silu_mul_quant
2025-05-07T20:32:22.0405965Z             if compiled:
2025-05-07T20:32:22.0406066Z                 op = torch.compile(op)
2025-05-07T20:32:22.0406169Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0406242Z     
2025-05-07T20:32:22.0406333Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0406337Z 
2025-05-07T20:32:22.0406433Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0406567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0406666Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0406807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0407187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0407279Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0407774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0407877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0408237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0408464Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0408803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0408895Z     kernel = self.compile(
2025-05-07T20:32:22.0409325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0409537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0409661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0409669Z 
2025-05-07T20:32:22.0409874Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1fb89d0>
2025-05-07T20:32:22.0410680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0411202Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1fc8680>}
2025-05-07T20:32:22.0411957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0412189Z context = <triton._C.libtriton.ir.context object at 0x7f16f1f5d030>
2025-05-07T20:32:22.0412194Z 
2025-05-07T20:32:22.0412354Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0412616Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0412720Z                            module_map=module_map)
2025-05-07T20:32:22.0412878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0412978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0413056Z E       ^
2025-05-07T20:32:22.0413409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0413413Z 
2025-05-07T20:32:22.0413832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0413842Z 
2025-05-07T20:32:22.0413946Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0414167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0414246Z     T=1,
2025-05-07T20:32:22.0414320Z     D=7168,
2025-05-07T20:32:22.0414405Z     scale_ub=None,
2025-05-07T20:32:22.0414489Z     contiguous=False,
2025-05-07T20:32:22.0414573Z     compiled=True,
2025-05-07T20:32:22.0414648Z )
2025-05-07T20:32:22.0414863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0415025Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0415030Z 
2025-05-07T20:32:22.0415106Z     @given(
2025-05-07T20:32:22.0415221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0415315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0415433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0415590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0415712Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0415783Z     )
2025-05-07T20:32:22.0416026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0416119Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0416195Z         self,
2025-05-07T20:32:22.0416270Z         T: int,
2025-05-07T20:32:22.0416351Z         D: int,
2025-05-07T20:32:22.0416446Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0416535Z         contiguous: bool,
2025-05-07T20:32:22.0416623Z         compiled: bool,
2025-05-07T20:32:22.0416699Z     ) -> None:
2025-05-07T20:32:22.0416789Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0416862Z     
2025-05-07T20:32:22.0417027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0417103Z     
2025-05-07T20:32:22.0417192Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0417355Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0417485Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0417564Z         x0 = x[:, :D]
2025-05-07T20:32:22.0417641Z         x1 = x[:, D:]
2025-05-07T20:32:22.0417714Z     
2025-05-07T20:32:22.0417792Z         if contiguous:
2025-05-07T20:32:22.0417880Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0417971Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0418041Z     
2025-05-07T20:32:22.0418128Z         if scale_ub is not None:
2025-05-07T20:32:22.0418233Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0418365Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0418444Z             )
2025-05-07T20:32:22.0418519Z         else:
2025-05-07T20:32:22.0418610Z             scale_ub_tensor = None
2025-05-07T20:32:22.0418684Z     
2025-05-07T20:32:22.0418809Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0418899Z             op = silu_mul_quant
2025-05-07T20:32:22.0418990Z             if compiled:
2025-05-07T20:32:22.0419136Z                 op = torch.compile(op)
2025-05-07T20:32:22.0419236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0419308Z     
2025-05-07T20:32:22.0419398Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.0419514Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.0419589Z     
2025-05-07T20:32:22.0419721Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0419819Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.0419914Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.0420032Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.0420171Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0420243Z     
2025-05-07T20:32:22.0420342Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.0420347Z 
2025-05-07T20:32:22.0420444Z moe/activation_test.py:126: 
2025-05-07T20:32:22.0420573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0420681Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.0420813Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.0421374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.0421473Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.0421832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0422052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0422422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.0422680Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0423128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.0423386Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.0423762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.0423929Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.0424271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.0424348Z     fn()
2025-05-07T20:32:22.0424752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.0424834Z     self.fn.run(
2025-05-07T20:32:22.0425215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0425309Z     kernel = self.compile(
2025-05-07T20:32:22.0425734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0425914Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0426037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0426042Z 
2025-05-07T20:32:22.0426248Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1f87090>
2025-05-07T20:32:22.0427018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0427510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1fc9580>}
2025-05-07T20:32:22.0428262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0428515Z context = <triton._C.libtriton.ir.context object at 0x7f16f1f8b6b0>
2025-05-07T20:32:22.0428520Z 
2025-05-07T20:32:22.0428683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0428946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0429047Z                            module_map=module_map)
2025-05-07T20:32:22.0429212Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0429311Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.0429390Z E       ^
2025-05-07T20:32:22.0429745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0429752Z 
2025-05-07T20:32:22.0430166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0430175Z 
2025-05-07T20:32:22.0430283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0430502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0430577Z     T=1,
2025-05-07T20:32:22.0430658Z     D=5120,
2025-05-07T20:32:22.0430738Z     scale_ub=1200.0,
2025-05-07T20:32:22.0430822Z     contiguous=False,
2025-05-07T20:32:22.0430901Z     compiled=True,
2025-05-07T20:32:22.0430971Z )
2025-05-07T20:32:22.0431187Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0431349Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0431353Z 
2025-05-07T20:32:22.0431428Z     @given(
2025-05-07T20:32:22.0431549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0431648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0431806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0431928Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0432038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0432117Z     )
2025-05-07T20:32:22.0432357Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0432448Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0432528Z         self,
2025-05-07T20:32:22.0432603Z         T: int,
2025-05-07T20:32:22.0432678Z         D: int,
2025-05-07T20:32:22.0432780Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0432866Z         contiguous: bool,
2025-05-07T20:32:22.0432947Z         compiled: bool,
2025-05-07T20:32:22.0433028Z     ) -> None:
2025-05-07T20:32:22.0433124Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0433196Z     
2025-05-07T20:32:22.0433408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0433482Z     
2025-05-07T20:32:22.0433616Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0433738Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0433823Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0433906Z         x0 = x[:, :D]
2025-05-07T20:32:22.0433988Z         x1 = x[:, D:]
2025-05-07T20:32:22.0434053Z     
2025-05-07T20:32:22.0434131Z         if contiguous:
2025-05-07T20:32:22.0434229Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0434316Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0434391Z     
2025-05-07T20:32:22.0434479Z         if scale_ub is not None:
2025-05-07T20:32:22.0434581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0434714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0434786Z             )
2025-05-07T20:32:22.0434859Z         else:
2025-05-07T20:32:22.0434950Z             scale_ub_tensor = None
2025-05-07T20:32:22.0435019Z     
2025-05-07T20:32:22.0435147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0435282Z             op = silu_mul_quant
2025-05-07T20:32:22.0435365Z             if compiled:
2025-05-07T20:32:22.0435462Z                 op = torch.compile(op)
2025-05-07T20:32:22.0435570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0435635Z     
2025-05-07T20:32:22.0435727Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0435732Z 
2025-05-07T20:32:22.0435827Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0435953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0436053Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0436148Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0436514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0436608Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0437106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0437211Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0437567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0437786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0438127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0438219Z     kernel = self.compile(
2025-05-07T20:32:22.0438884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0439070Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0439194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0439199Z 
2025-05-07T20:32:22.0439411Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a23f7650>
2025-05-07T20:32:22.0440272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0440775Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1fcab60>}
2025-05-07T20:32:22.0441575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0441764Z context = <triton._C.libtriton.ir.context object at 0x7f18a23d80b0>
2025-05-07T20:32:22.0441769Z 
2025-05-07T20:32:22.0441934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0442251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0442412Z                            module_map=module_map)
2025-05-07T20:32:22.0442573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0442667Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0442745Z E       ^
2025-05-07T20:32:22.0443099Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0443104Z 
2025-05-07T20:32:22.0443637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0443642Z 
2025-05-07T20:32:22.0443743Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0443965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0444044Z     T=1,
2025-05-07T20:32:22.0444118Z     D=5120,
2025-05-07T20:32:22.0444207Z     scale_ub=1200.0,
2025-05-07T20:32:22.0444300Z     contiguous=False,
2025-05-07T20:32:22.0444468Z     compiled=False,
2025-05-07T20:32:22.0444541Z )
2025-05-07T20:32:22.0444760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0444924Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0444929Z 
2025-05-07T20:32:22.0445005Z     @given(
2025-05-07T20:32:22.0445125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0445219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0445333Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0445444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0445553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0445628Z     )
2025-05-07T20:32:22.0445869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0445961Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0446036Z         self,
2025-05-07T20:32:22.0446114Z         T: int,
2025-05-07T20:32:22.0446192Z         D: int,
2025-05-07T20:32:22.0446291Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0446377Z         contiguous: bool,
2025-05-07T20:32:22.0446461Z         compiled: bool,
2025-05-07T20:32:22.0446539Z     ) -> None:
2025-05-07T20:32:22.0446631Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0446707Z     
2025-05-07T20:32:22.0446872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0446942Z     
2025-05-07T20:32:22.0447034Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0447153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0447239Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0447319Z         x0 = x[:, :D]
2025-05-07T20:32:22.0447394Z         x1 = x[:, D:]
2025-05-07T20:32:22.0447463Z     
2025-05-07T20:32:22.0447548Z         if contiguous:
2025-05-07T20:32:22.0447641Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0447770Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0447850Z     
2025-05-07T20:32:22.0447937Z         if scale_ub is not None:
2025-05-07T20:32:22.0448043Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0448174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0448250Z             )
2025-05-07T20:32:22.0448328Z         else:
2025-05-07T20:32:22.0448420Z             scale_ub_tensor = None
2025-05-07T20:32:22.0448491Z     
2025-05-07T20:32:22.0448618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0448706Z             op = silu_mul_quant
2025-05-07T20:32:22.0448786Z             if compiled:
2025-05-07T20:32:22.0448883Z                 op = torch.compile(op)
2025-05-07T20:32:22.0448984Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0449053Z     
2025-05-07T20:32:22.0449141Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0449146Z 
2025-05-07T20:32:22.0449283Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0449417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0449556Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0449653Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0450153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0450249Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0450607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0450831Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0451219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0451313Z     kernel = self.compile(
2025-05-07T20:32:22.0451699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0451916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0452043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0452047Z 
2025-05-07T20:32:22.0452248Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2361390>
2025-05-07T20:32:22.0453015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0453509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1fcb2e0>}
2025-05-07T20:32:22.0454261Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0454455Z context = <triton._C.libtriton.ir.context object at 0x7f18a23c99f0>
2025-05-07T20:32:22.0454459Z 
2025-05-07T20:32:22.0454623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0454887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0454989Z                            module_map=module_map)
2025-05-07T20:32:22.0455147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0455247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0455323Z E       ^
2025-05-07T20:32:22.0455674Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0455684Z 
2025-05-07T20:32:22.0456100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0456107Z 
2025-05-07T20:32:22.0456255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0456481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0456556Z     T=16384,
2025-05-07T20:32:22.0456632Z     D=5120,
2025-05-07T20:32:22.0456721Z     scale_ub=1200.0,
2025-05-07T20:32:22.0456803Z     contiguous=False,
2025-05-07T20:32:22.0456883Z     compiled=True,
2025-05-07T20:32:22.0456957Z )
2025-05-07T20:32:22.0457172Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0457349Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0457353Z 
2025-05-07T20:32:22.0457427Z     @given(
2025-05-07T20:32:22.0457547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0457646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0457756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0457914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0458101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0458175Z     )
2025-05-07T20:32:22.0458421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0458510Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0458584Z         self,
2025-05-07T20:32:22.0458665Z         T: int,
2025-05-07T20:32:22.0458739Z         D: int,
2025-05-07T20:32:22.0458834Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0458923Z         contiguous: bool,
2025-05-07T20:32:22.0459008Z         compiled: bool,
2025-05-07T20:32:22.0459083Z     ) -> None:
2025-05-07T20:32:22.0459182Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0459252Z     
2025-05-07T20:32:22.0459420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0459493Z     
2025-05-07T20:32:22.0459582Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0459707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0459804Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0459929Z         x0 = x[:, :D]
2025-05-07T20:32:22.0460013Z         x1 = x[:, D:]
2025-05-07T20:32:22.0460086Z     
2025-05-07T20:32:22.0460168Z         if contiguous:
2025-05-07T20:32:22.0460261Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0460346Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0460416Z     
2025-05-07T20:32:22.0460507Z         if scale_ub is not None:
2025-05-07T20:32:22.0460608Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0460739Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0460817Z             )
2025-05-07T20:32:22.0460888Z         else:
2025-05-07T20:32:22.0460980Z             scale_ub_tensor = None
2025-05-07T20:32:22.0461052Z     
2025-05-07T20:32:22.0461177Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0461267Z             op = silu_mul_quant
2025-05-07T20:32:22.0461354Z             if compiled:
2025-05-07T20:32:22.0461456Z                 op = torch.compile(op)
2025-05-07T20:32:22.0461565Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0461636Z     
2025-05-07T20:32:22.0461723Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0461728Z 
2025-05-07T20:32:22.0461828Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0461954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0462052Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0462151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0462519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0462616Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0463112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0463210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0463615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0463842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0465510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0465604Z     kernel = self.compile(
2025-05-07T20:32:22.0465985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0466163Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0466286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0466291Z 
2025-05-07T20:32:22.0466499Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1ba4350>
2025-05-07T20:32:22.0467316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0467851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1b40fe0>}
2025-05-07T20:32:22.0468605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0468793Z context = <triton._C.libtriton.ir.context object at 0x7f16f1b709b0>
2025-05-07T20:32:22.0468798Z 
2025-05-07T20:32:22.0468966Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0469232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0469338Z                            module_map=module_map)
2025-05-07T20:32:22.0469546Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0469643Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0469717Z E       ^
2025-05-07T20:32:22.0470076Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0470080Z 
2025-05-07T20:32:22.0470495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0470499Z 
2025-05-07T20:32:22.0470602Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0470819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0470890Z     T=2048,
2025-05-07T20:32:22.0470968Z     D=7168,
2025-05-07T20:32:22.0471046Z     scale_ub=1200.0,
2025-05-07T20:32:22.0471131Z     contiguous=False,
2025-05-07T20:32:22.0471219Z     compiled=True,
2025-05-07T20:32:22.0471293Z )
2025-05-07T20:32:22.0471510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0471690Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0471695Z 
2025-05-07T20:32:22.0471768Z     @given(
2025-05-07T20:32:22.0471887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0471980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0472091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0472210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0472320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0472392Z     )
2025-05-07T20:32:22.0472636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0472729Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0472806Z         self,
2025-05-07T20:32:22.0472881Z         T: int,
2025-05-07T20:32:22.0472958Z         D: int,
2025-05-07T20:32:22.0473101Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0473195Z         contiguous: bool,
2025-05-07T20:32:22.0473276Z         compiled: bool,
2025-05-07T20:32:22.0473357Z     ) -> None:
2025-05-07T20:32:22.0473448Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0473520Z     
2025-05-07T20:32:22.0473688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0473757Z     
2025-05-07T20:32:22.0473850Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0473975Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0474063Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0474140Z         x0 = x[:, :D]
2025-05-07T20:32:22.0474221Z         x1 = x[:, D:]
2025-05-07T20:32:22.0474291Z     
2025-05-07T20:32:22.0474376Z         if contiguous:
2025-05-07T20:32:22.0474465Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0474548Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0474623Z     
2025-05-07T20:32:22.0476136Z         if scale_ub is not None:
2025-05-07T20:32:22.0476280Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0476418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0476489Z             )
2025-05-07T20:32:22.0476563Z         else:
2025-05-07T20:32:22.0476656Z             scale_ub_tensor = None
2025-05-07T20:32:22.0476725Z     
2025-05-07T20:32:22.0476850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0476941Z             op = silu_mul_quant
2025-05-07T20:32:22.0477021Z             if compiled:
2025-05-07T20:32:22.0477122Z                 op = torch.compile(op)
2025-05-07T20:32:22.0477223Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0477292Z     
2025-05-07T20:32:22.0477383Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0477388Z 
2025-05-07T20:32:22.0477484Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0477612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0477712Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0477857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0478224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0478317Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0478811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0478906Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0479264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0479484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0479826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0479917Z     kernel = self.compile(
2025-05-07T20:32:22.0480310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0480486Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0480612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0480616Z 
2025-05-07T20:32:22.0480823Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1b05910>
2025-05-07T20:32:22.0481642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0482139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1b41b20>}
2025-05-07T20:32:22.0482930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0483125Z context = <triton._C.libtriton.ir.context object at 0x7f16f1bd5f30>
2025-05-07T20:32:22.0483129Z 
2025-05-07T20:32:22.0483382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0483645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0483752Z                            module_map=module_map)
2025-05-07T20:32:22.0483909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0484004Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0484081Z E       ^
2025-05-07T20:32:22.0484433Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0484438Z 
2025-05-07T20:32:22.0484899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0484943Z 
2025-05-07T20:32:22.0485044Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0485264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0485345Z     T=1,
2025-05-07T20:32:22.0485421Z     D=5120,
2025-05-07T20:32:22.0485501Z     scale_ub=None,
2025-05-07T20:32:22.0485589Z     contiguous=False,
2025-05-07T20:32:22.0485672Z     compiled=False,
2025-05-07T20:32:22.0485743Z )
2025-05-07T20:32:22.0485962Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0486124Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0486128Z 
2025-05-07T20:32:22.0486209Z     @given(
2025-05-07T20:32:22.0486325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0486420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0486533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0486650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0486803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0486880Z     )
2025-05-07T20:32:22.0487121Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0487213Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0487290Z         self,
2025-05-07T20:32:22.0487364Z         T: int,
2025-05-07T20:32:22.0487441Z         D: int,
2025-05-07T20:32:22.0487536Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0487623Z         contiguous: bool,
2025-05-07T20:32:22.0487709Z         compiled: bool,
2025-05-07T20:32:22.0487786Z     ) -> None:
2025-05-07T20:32:22.0487878Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0487952Z     
2025-05-07T20:32:22.0488120Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0488190Z     
2025-05-07T20:32:22.0488281Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0488407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0488496Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0488576Z         x0 = x[:, :D]
2025-05-07T20:32:22.0488655Z         x1 = x[:, D:]
2025-05-07T20:32:22.0488725Z     
2025-05-07T20:32:22.0488807Z         if contiguous:
2025-05-07T20:32:22.0488895Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0488982Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0489049Z     
2025-05-07T20:32:22.0489135Z         if scale_ub is not None:
2025-05-07T20:32:22.0489242Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0489371Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0489443Z             )
2025-05-07T20:32:22.0489522Z         else:
2025-05-07T20:32:22.0489613Z             scale_ub_tensor = None
2025-05-07T20:32:22.0489684Z     
2025-05-07T20:32:22.0489817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0489904Z             op = silu_mul_quant
2025-05-07T20:32:22.0490037Z             if compiled:
2025-05-07T20:32:22.0490142Z                 op = torch.compile(op)
2025-05-07T20:32:22.0490245Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0490317Z     
2025-05-07T20:32:22.0490407Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0490411Z 
2025-05-07T20:32:22.0490506Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0490638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0490735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0490831Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0491331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0491424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0491848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0492109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0492455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0492550Z     kernel = self.compile(
2025-05-07T20:32:22.0492931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0493102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0493228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0493233Z 
2025-05-07T20:32:22.0493435Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a32031d0>
2025-05-07T20:32:22.0494210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0494751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1b42e80>}
2025-05-07T20:32:22.0495504Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0495689Z context = <triton._C.libtriton.ir.context object at 0x7f18a320b830>
2025-05-07T20:32:22.0495693Z 
2025-05-07T20:32:22.0495856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0496125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0496227Z                            module_map=module_map)
2025-05-07T20:32:22.0496387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0496487Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0496567Z E       ^
2025-05-07T20:32:22.0496924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0496929Z 
2025-05-07T20:32:22.0497338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0497343Z 
2025-05-07T20:32:22.0497443Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0505862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0505977Z     T=4096,
2025-05-07T20:32:22.0506064Z     D=7168,
2025-05-07T20:32:22.0506185Z     scale_ub=1200.0,
2025-05-07T20:32:22.0506281Z     contiguous=False,
2025-05-07T20:32:22.0506374Z     compiled=False,
2025-05-07T20:32:22.0506459Z )
2025-05-07T20:32:22.0506727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0507001Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0507013Z 
2025-05-07T20:32:22.0507101Z     @given(
2025-05-07T20:32:22.0507227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0507333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0507458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0507579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0507699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0507787Z     )
2025-05-07T20:32:22.0508044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0508143Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0508229Z         self,
2025-05-07T20:32:22.0508311Z         T: int,
2025-05-07T20:32:22.0508401Z         D: int,
2025-05-07T20:32:22.0508506Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0508647Z         contiguous: bool,
2025-05-07T20:32:22.0508747Z         compiled: bool,
2025-05-07T20:32:22.0508877Z     ) -> None:
2025-05-07T20:32:22.0508980Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0509067Z     
2025-05-07T20:32:22.0509241Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0509350Z     
2025-05-07T20:32:22.0509458Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0509613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0509706Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0509794Z         x0 = x[:, :D]
2025-05-07T20:32:22.0509879Z         x1 = x[:, D:]
2025-05-07T20:32:22.0509959Z     
2025-05-07T20:32:22.0510056Z         if contiguous:
2025-05-07T20:32:22.0510153Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0510252Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0510331Z     
2025-05-07T20:32:22.0510424Z         if scale_ub is not None:
2025-05-07T20:32:22.0510562Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0510723Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0510858Z             )
2025-05-07T20:32:22.0510946Z         else:
2025-05-07T20:32:22.0511071Z             scale_ub_tensor = None
2025-05-07T20:32:22.0511147Z     
2025-05-07T20:32:22.0511285Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0511379Z             op = silu_mul_quant
2025-05-07T20:32:22.0511467Z             if compiled:
2025-05-07T20:32:22.0511577Z                 op = torch.compile(op)
2025-05-07T20:32:22.0511684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0511764Z     
2025-05-07T20:32:22.0511857Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0511861Z 
2025-05-07T20:32:22.0511961Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0512098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0512203Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0512311Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0512822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0512928Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0513300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0513531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0513878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0513980Z     kernel = self.compile(
2025-05-07T20:32:22.0514370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0514549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0514683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0514688Z 
2025-05-07T20:32:22.0514946Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a32b2210>
2025-05-07T20:32:22.0515736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0516236Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a329c040>}
2025-05-07T20:32:22.0516993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0517186Z context = <triton._C.libtriton.ir.context object at 0x7f18a3272830>
2025-05-07T20:32:22.0517191Z 
2025-05-07T20:32:22.0517401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0517719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0517831Z                            module_map=module_map)
2025-05-07T20:32:22.0517998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0518100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0518182Z E       ^
2025-05-07T20:32:22.0518543Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0518548Z 
2025-05-07T20:32:22.0518968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0518972Z 
2025-05-07T20:32:22.0519079Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0519313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0519398Z     T=16384,
2025-05-07T20:32:22.0519483Z     D=7168,
2025-05-07T20:32:22.0519618Z     scale_ub=None,
2025-05-07T20:32:22.0519706Z     contiguous=True,
2025-05-07T20:32:22.0519794Z     compiled=True,
2025-05-07T20:32:22.0519868Z )
2025-05-07T20:32:22.0520090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0520273Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0520278Z 
2025-05-07T20:32:22.0520357Z     @given(
2025-05-07T20:32:22.0520481Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0520589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0520708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0520837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0520954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0521034Z     )
2025-05-07T20:32:22.0521290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0521390Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0521474Z         self,
2025-05-07T20:32:22.0521561Z         T: int,
2025-05-07T20:32:22.0521642Z         D: int,
2025-05-07T20:32:22.0521743Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0542593Z         contiguous: bool,
2025-05-07T20:32:22.0542710Z         compiled: bool,
2025-05-07T20:32:22.0542786Z     ) -> None:
2025-05-07T20:32:22.0542883Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0542952Z     
2025-05-07T20:32:22.0543131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0543200Z     
2025-05-07T20:32:22.0543289Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0543412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0543497Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0543573Z         x0 = x[:, :D]
2025-05-07T20:32:22.0543651Z         x1 = x[:, D:]
2025-05-07T20:32:22.0543720Z     
2025-05-07T20:32:22.0543810Z         if contiguous:
2025-05-07T20:32:22.0544092Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0544183Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0544251Z     
2025-05-07T20:32:22.0544344Z         if scale_ub is not None:
2025-05-07T20:32:22.0544447Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0544584Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0544656Z             )
2025-05-07T20:32:22.0544728Z         else:
2025-05-07T20:32:22.0544823Z             scale_ub_tensor = None
2025-05-07T20:32:22.0544891Z     
2025-05-07T20:32:22.0545020Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0545108Z             op = silu_mul_quant
2025-05-07T20:32:22.0545229Z             if compiled:
2025-05-07T20:32:22.0545325Z                 op = torch.compile(op)
2025-05-07T20:32:22.0545430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0545501Z     
2025-05-07T20:32:22.0545661Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0545667Z 
2025-05-07T20:32:22.0545835Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0545963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0546066Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0546166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0546533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0546630Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0547121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0547212Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0547574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0547793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0548137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0548313Z     kernel = self.compile(
2025-05-07T20:32:22.0548696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0548869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0548991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0548996Z 
2025-05-07T20:32:22.0549206Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a32c50d0>
2025-05-07T20:32:22.0549975Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0550473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a329d260>}
2025-05-07T20:32:22.0551229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0551419Z context = <triton._C.libtriton.ir.context object at 0x7f18a32cd730>
2025-05-07T20:32:22.0551424Z 
2025-05-07T20:32:22.0551589Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0551849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0551951Z                            module_map=module_map)
2025-05-07T20:32:22.0552114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0552208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0552285Z E       ^
2025-05-07T20:32:22.0552683Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0552691Z 
2025-05-07T20:32:22.0553105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0553109Z 
2025-05-07T20:32:22.0553212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0553430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0553500Z     T=4096,
2025-05-07T20:32:22.0553577Z     D=5120,
2025-05-07T20:32:22.0553654Z     scale_ub=None,
2025-05-07T20:32:22.0553747Z     contiguous=False,
2025-05-07T20:32:22.0553824Z     compiled=True,
2025-05-07T20:32:22.0553895Z )
2025-05-07T20:32:22.0554112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0554281Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0554286Z 
2025-05-07T20:32:22.0554481Z     @given(
2025-05-07T20:32:22.0554673Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0554771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0554881Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0554996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0555104Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0555181Z     )
2025-05-07T20:32:22.0555423Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0555515Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0555591Z         self,
2025-05-07T20:32:22.0555665Z         T: int,
2025-05-07T20:32:22.0555737Z         D: int,
2025-05-07T20:32:22.0555840Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0555924Z         contiguous: bool,
2025-05-07T20:32:22.0556003Z         compiled: bool,
2025-05-07T20:32:22.0556081Z     ) -> None:
2025-05-07T20:32:22.0556174Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0556245Z     
2025-05-07T20:32:22.0556420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0556534Z     
2025-05-07T20:32:22.0556627Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0556747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0556831Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0556915Z         x0 = x[:, :D]
2025-05-07T20:32:22.0556990Z         x1 = x[:, D:]
2025-05-07T20:32:22.0557056Z     
2025-05-07T20:32:22.0557138Z         if contiguous:
2025-05-07T20:32:22.0557223Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0557309Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0557386Z     
2025-05-07T20:32:22.0557473Z         if scale_ub is not None:
2025-05-07T20:32:22.0557574Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0557713Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0557786Z             )
2025-05-07T20:32:22.0557873Z         else:
2025-05-07T20:32:22.0557971Z             scale_ub_tensor = None
2025-05-07T20:32:22.0558048Z     
2025-05-07T20:32:22.0558185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0558274Z             op = silu_mul_quant
2025-05-07T20:32:22.0558356Z             if compiled:
2025-05-07T20:32:22.0558464Z                 op = torch.compile(op)
2025-05-07T20:32:22.0558568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0558640Z     
2025-05-07T20:32:22.0558738Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0558742Z 
2025-05-07T20:32:22.0558840Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0558975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0559075Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0559175Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0559555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0559651Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0560196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0560303Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0560706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0560951Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0561293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0561391Z     kernel = self.compile(
2025-05-07T20:32:22.0561788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0561965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0562769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0562824Z 
2025-05-07T20:32:22.0563034Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1c4a990>
2025-05-07T20:32:22.0563914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0564420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a329dda0>}
2025-05-07T20:32:22.0565170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0565367Z context = <triton._C.libtriton.ir.context object at 0x7f16f1c06ff0>
2025-05-07T20:32:22.0565372Z 
2025-05-07T20:32:22.0565541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0565848Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0565965Z                            module_map=module_map)
2025-05-07T20:32:22.0566126Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0566228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0566305Z E       ^
2025-05-07T20:32:22.0566659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0566664Z 
2025-05-07T20:32:22.0567093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0567097Z 
2025-05-07T20:32:22.0567200Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0567422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0567513Z     T=4096,
2025-05-07T20:32:22.0567594Z     D=5120,
2025-05-07T20:32:22.0567683Z     scale_ub=1200.0,
2025-05-07T20:32:22.0567769Z     contiguous=False,
2025-05-07T20:32:22.0567851Z     compiled=False,
2025-05-07T20:32:22.0567927Z )
2025-05-07T20:32:22.0568143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0568317Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0568321Z 
2025-05-07T20:32:22.0568405Z     @given(
2025-05-07T20:32:22.0568525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0568620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0568733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0568856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0568964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0569040Z     )
2025-05-07T20:32:22.0569327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0569423Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0569507Z         self,
2025-05-07T20:32:22.0569581Z         T: int,
2025-05-07T20:32:22.0569653Z         D: int,
2025-05-07T20:32:22.0569754Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0569842Z         contiguous: bool,
2025-05-07T20:32:22.0569928Z         compiled: bool,
2025-05-07T20:32:22.0570011Z     ) -> None:
2025-05-07T20:32:22.0570104Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0570175Z     
2025-05-07T20:32:22.0570348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0570427Z     
2025-05-07T20:32:22.0570540Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0570678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0570771Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0570855Z         x0 = x[:, :D]
2025-05-07T20:32:22.0570974Z         x1 = x[:, D:]
2025-05-07T20:32:22.0571045Z     
2025-05-07T20:32:22.0571176Z         if contiguous:
2025-05-07T20:32:22.0571263Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0571350Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0571428Z     
2025-05-07T20:32:22.0571515Z         if scale_ub is not None:
2025-05-07T20:32:22.0571617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0571754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0571828Z             )
2025-05-07T20:32:22.0571909Z         else:
2025-05-07T20:32:22.0571999Z             scale_ub_tensor = None
2025-05-07T20:32:22.0572069Z     
2025-05-07T20:32:22.0572204Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0572291Z             op = silu_mul_quant
2025-05-07T20:32:22.0572373Z             if compiled:
2025-05-07T20:32:22.0572479Z                 op = torch.compile(op)
2025-05-07T20:32:22.0572584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0572657Z     
2025-05-07T20:32:22.0572758Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0572804Z 
2025-05-07T20:32:22.0572900Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0573027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0573131Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0573228Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0573731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0573827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0574185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0574414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0574759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0574859Z     kernel = self.compile(
2025-05-07T20:32:22.0575247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0575421Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0575550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0575555Z 
2025-05-07T20:32:22.0575758Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1c53c50>
2025-05-07T20:32:22.0576528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0577032Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a329f420>}
2025-05-07T20:32:22.0577821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0578024Z context = <triton._C.libtriton.ir.context object at 0x7f16f1c4c2b0>
2025-05-07T20:32:22.0578029Z 
2025-05-07T20:32:22.0578191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0578460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0578563Z                            module_map=module_map)
2025-05-07T20:32:22.0578721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0578821Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0578897Z E       ^
2025-05-07T20:32:22.0579290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0579295Z 
2025-05-07T20:32:22.0579753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0579760Z 
2025-05-07T20:32:22.0579862Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0580091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0580166Z     T=4096,
2025-05-07T20:32:22.0580237Z     D=5120,
2025-05-07T20:32:22.0580323Z     scale_ub=1200.0,
2025-05-07T20:32:22.0580407Z     contiguous=False,
2025-05-07T20:32:22.0580489Z     compiled=True,
2025-05-07T20:32:22.0580570Z )
2025-05-07T20:32:22.0580787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0580965Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0580970Z 
2025-05-07T20:32:22.0581044Z     @given(
2025-05-07T20:32:22.0581163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0581266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0581423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0581540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0581658Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0581731Z     )
2025-05-07T20:32:22.0581974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0582073Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0582148Z         self,
2025-05-07T20:32:22.0582229Z         T: int,
2025-05-07T20:32:22.0582306Z         D: int,
2025-05-07T20:32:22.0582400Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0582495Z         contiguous: bool,
2025-05-07T20:32:22.0582583Z         compiled: bool,
2025-05-07T20:32:22.0582657Z     ) -> None:
2025-05-07T20:32:22.0582755Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0582827Z     
2025-05-07T20:32:22.0583000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0583082Z     
2025-05-07T20:32:22.0583177Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0583301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0583393Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0583471Z         x0 = x[:, :D]
2025-05-07T20:32:22.0583548Z         x1 = x[:, D:]
2025-05-07T20:32:22.0583629Z     
2025-05-07T20:32:22.0583712Z         if contiguous:
2025-05-07T20:32:22.0583805Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0583893Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0583963Z     
2025-05-07T20:32:22.0584061Z         if scale_ub is not None:
2025-05-07T20:32:22.0584166Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0584300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0584377Z             )
2025-05-07T20:32:22.0584451Z         else:
2025-05-07T20:32:22.0584543Z             scale_ub_tensor = None
2025-05-07T20:32:22.0584619Z     
2025-05-07T20:32:22.0584795Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0584890Z             op = silu_mul_quant
2025-05-07T20:32:22.0584978Z             if compiled:
2025-05-07T20:32:22.0585074Z                 op = torch.compile(op)
2025-05-07T20:32:22.0585182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0585254Z     
2025-05-07T20:32:22.0585341Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0585346Z 
2025-05-07T20:32:22.0585447Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0585572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0585669Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0585771Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0586140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0586232Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0586799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0586934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0587298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0587520Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0587860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0587959Z     kernel = self.compile(
2025-05-07T20:32:22.0588341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0588518Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0588643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0588647Z 
2025-05-07T20:32:22.0588855Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1e52b50>
2025-05-07T20:32:22.0589688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0590182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1cf0860>}
2025-05-07T20:32:22.0590933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0591119Z context = <triton._C.libtriton.ir.context object at 0x7f16f1e5b170>
2025-05-07T20:32:22.0591124Z 
2025-05-07T20:32:22.0591297Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0591611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0591720Z                            module_map=module_map)
2025-05-07T20:32:22.0591884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0591983Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0592058Z E       ^
2025-05-07T20:32:22.0592423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0592427Z 
2025-05-07T20:32:22.0592842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0592847Z 
2025-05-07T20:32:22.0592956Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0593177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0593251Z     T=2048,
2025-05-07T20:32:22.0593334Z     D=7168,
2025-05-07T20:32:22.0593414Z     scale_ub=1200.0,
2025-05-07T20:32:22.0593544Z     contiguous=False,
2025-05-07T20:32:22.0593636Z     compiled=False,
2025-05-07T20:32:22.0593705Z )
2025-05-07T20:32:22.0593919Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0594100Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0594104Z 
2025-05-07T20:32:22.0594178Z     @given(
2025-05-07T20:32:22.0594300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0594396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0594508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0594632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0594741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0594813Z     )
2025-05-07T20:32:22.0595062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0595195Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0595313Z         self,
2025-05-07T20:32:22.0595399Z         T: int,
2025-05-07T20:32:22.0595475Z         D: int,
2025-05-07T20:32:22.0595575Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0595661Z         contiguous: bool,
2025-05-07T20:32:22.0595745Z         compiled: bool,
2025-05-07T20:32:22.0595828Z     ) -> None:
2025-05-07T20:32:22.0595921Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0595993Z     
2025-05-07T20:32:22.0596166Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0596240Z     
2025-05-07T20:32:22.0596330Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0596459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0596547Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0596624Z         x0 = x[:, :D]
2025-05-07T20:32:22.0596707Z         x1 = x[:, D:]
2025-05-07T20:32:22.0596777Z     
2025-05-07T20:32:22.0596858Z         if contiguous:
2025-05-07T20:32:22.0596956Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0597047Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0597166Z     
2025-05-07T20:32:22.0597253Z         if scale_ub is not None:
2025-05-07T20:32:22.0597356Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0597493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0597566Z             )
2025-05-07T20:32:22.0597641Z         else:
2025-05-07T20:32:22.0597737Z             scale_ub_tensor = None
2025-05-07T20:32:22.0597809Z     
2025-05-07T20:32:22.0597935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0598029Z             op = silu_mul_quant
2025-05-07T20:32:22.0598111Z             if compiled:
2025-05-07T20:32:22.0598207Z                 op = torch.compile(op)
2025-05-07T20:32:22.0598315Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0598385Z     
2025-05-07T20:32:22.0598478Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0598483Z 
2025-05-07T20:32:22.0598579Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0598711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0598819Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0598915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0599413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0599512Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0599872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0600099Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0600443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0600534Z     kernel = self.compile(
2025-05-07T20:32:22.0600969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0601172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0601311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0601325Z 
2025-05-07T20:32:22.0601534Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1ed7f50>
2025-05-07T20:32:22.0602310Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0602809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1cf16c0>}
2025-05-07T20:32:22.0603684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0603913Z context = <triton._C.libtriton.ir.context object at 0x7f16f1e885f0>
2025-05-07T20:32:22.0603917Z 
2025-05-07T20:32:22.0604086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0604346Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0604458Z                            module_map=module_map)
2025-05-07T20:32:22.0604618Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0604712Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0604791Z E       ^
2025-05-07T20:32:22.0605145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0605150Z 
2025-05-07T20:32:22.0605566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0605624Z 
2025-05-07T20:32:22.0605725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0605945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0606027Z     T=1,
2025-05-07T20:32:22.0606103Z     D=7168,
2025-05-07T20:32:22.0606182Z     scale_ub=None,
2025-05-07T20:32:22.0606268Z     contiguous=True,
2025-05-07T20:32:22.0606350Z     compiled=False,
2025-05-07T20:32:22.0606421Z )
2025-05-07T20:32:22.0606645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0606807Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0606812Z 
2025-05-07T20:32:22.0606896Z     @given(
2025-05-07T20:32:22.0607012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0607109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0607228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0607349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0607464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0607545Z     )
2025-05-07T20:32:22.0607787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0607878Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0607960Z         self,
2025-05-07T20:32:22.0608036Z         T: int,
2025-05-07T20:32:22.0608110Z         D: int,
2025-05-07T20:32:22.0608211Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0608296Z         contiguous: bool,
2025-05-07T20:32:22.0608386Z         compiled: bool,
2025-05-07T20:32:22.0608462Z     ) -> None:
2025-05-07T20:32:22.0608553Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0608630Z     
2025-05-07T20:32:22.0608796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0608868Z     
2025-05-07T20:32:22.0608961Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0609130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0609224Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0609306Z         x0 = x[:, :D]
2025-05-07T20:32:22.0609385Z         x1 = x[:, D:]
2025-05-07T20:32:22.0609455Z     
2025-05-07T20:32:22.0609542Z         if contiguous:
2025-05-07T20:32:22.0609632Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0609723Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0609794Z     
2025-05-07T20:32:22.0609883Z         if scale_ub is not None:
2025-05-07T20:32:22.0609991Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0610120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0610191Z             )
2025-05-07T20:32:22.0610270Z         else:
2025-05-07T20:32:22.0610364Z             scale_ub_tensor = None
2025-05-07T20:32:22.0610440Z     
2025-05-07T20:32:22.0610596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0610749Z             op = silu_mul_quant
2025-05-07T20:32:22.0610832Z             if compiled:
2025-05-07T20:32:22.0610983Z                 op = torch.compile(op)
2025-05-07T20:32:22.0611085Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0611163Z     
2025-05-07T20:32:22.0611252Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0611256Z 
2025-05-07T20:32:22.0611351Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0611483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0611579Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0611675Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0612181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0612275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0612633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0612864Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0613253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0613350Z     kernel = self.compile(
2025-05-07T20:32:22.0613734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0613907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0614038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0614042Z 
2025-05-07T20:32:22.0614245Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1edb750>
2025-05-07T20:32:22.0615016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0615514Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1cf0fe0>}
2025-05-07T20:32:22.0616269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0616457Z context = <triton._C.libtriton.ir.context object at 0x7f16f1e37d70>
2025-05-07T20:32:22.0616462Z 
2025-05-07T20:32:22.0616625Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0616892Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0616998Z                            module_map=module_map)
2025-05-07T20:32:22.0617159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0617260Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0617338Z E       ^
2025-05-07T20:32:22.0617765Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0617770Z 
2025-05-07T20:32:22.0618186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0618190Z 
2025-05-07T20:32:22.0618289Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0618518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0618589Z     T=16384,
2025-05-07T20:32:22.0618668Z     D=7168,
2025-05-07T20:32:22.0618749Z     scale_ub=1200.0,
2025-05-07T20:32:22.0618831Z     contiguous=False,
2025-05-07T20:32:22.0618916Z     compiled=True,
2025-05-07T20:32:22.0618987Z )
2025-05-07T20:32:22.0619202Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0619424Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0619469Z 
2025-05-07T20:32:22.0619549Z     @given(
2025-05-07T20:32:22.0619664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0619767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0619878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0620000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0620111Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0620184Z     )
2025-05-07T20:32:22.0620434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0620526Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0620601Z         self,
2025-05-07T20:32:22.0620681Z         T: int,
2025-05-07T20:32:22.0620759Z         D: int,
2025-05-07T20:32:22.0620854Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0620944Z         contiguous: bool,
2025-05-07T20:32:22.0621029Z         compiled: bool,
2025-05-07T20:32:22.0621104Z     ) -> None:
2025-05-07T20:32:22.0621211Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0621328Z     
2025-05-07T20:32:22.0621495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0621574Z     
2025-05-07T20:32:22.0621667Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0621796Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0621885Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0621961Z         x0 = x[:, :D]
2025-05-07T20:32:22.0622040Z         x1 = x[:, D:]
2025-05-07T20:32:22.0622108Z     
2025-05-07T20:32:22.0622187Z         if contiguous:
2025-05-07T20:32:22.0622278Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0622365Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0622439Z     
2025-05-07T20:32:22.0622532Z         if scale_ub is not None:
2025-05-07T20:32:22.0622633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0622767Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0622848Z             )
2025-05-07T20:32:22.0622923Z         else:
2025-05-07T20:32:22.0623023Z             scale_ub_tensor = None
2025-05-07T20:32:22.0623094Z     
2025-05-07T20:32:22.0623219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0623312Z             op = silu_mul_quant
2025-05-07T20:32:22.0623391Z             if compiled:
2025-05-07T20:32:22.0623489Z                 op = torch.compile(op)
2025-05-07T20:32:22.0623596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0623667Z     
2025-05-07T20:32:22.0623754Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0623758Z 
2025-05-07T20:32:22.0623866Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0623992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0624097Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0624196Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0624613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0624718Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0625214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0625310Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0625675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0625898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0626246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0626335Z     kernel = self.compile(
2025-05-07T20:32:22.0626719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0626938Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0627105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0627109Z 
2025-05-07T20:32:22.0627313Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a262e250>
2025-05-07T20:32:22.0628088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0628584Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1cf3b00>}
2025-05-07T20:32:22.0629340Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0629530Z context = <triton._C.libtriton.ir.context object at 0x7f18a26528b0>
2025-05-07T20:32:22.0629578Z 
2025-05-07T20:32:22.0629746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0630008Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0630112Z                            module_map=module_map)
2025-05-07T20:32:22.0630276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0630370Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0630447Z E       ^
2025-05-07T20:32:22.0630839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0630845Z 
2025-05-07T20:32:22.0631277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0631282Z 
2025-05-07T20:32:22.0631389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0631610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0631688Z     T=1,
2025-05-07T20:32:22.0631767Z     D=7168,
2025-05-07T20:32:22.0631845Z     scale_ub=None,
2025-05-07T20:32:22.0631927Z     contiguous=False,
2025-05-07T20:32:22.0632012Z     compiled=False,
2025-05-07T20:32:22.0632083Z )
2025-05-07T20:32:22.0632304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0632468Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0632472Z 
2025-05-07T20:32:22.0632549Z     @given(
2025-05-07T20:32:22.0632671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0632767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0632880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0632999Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0633111Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0633192Z     )
2025-05-07T20:32:22.0633481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0633573Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0633657Z         self,
2025-05-07T20:32:22.0633733Z         T: int,
2025-05-07T20:32:22.0633804Z         D: int,
2025-05-07T20:32:22.0633906Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0633991Z         contiguous: bool,
2025-05-07T20:32:22.0634074Z         compiled: bool,
2025-05-07T20:32:22.0634158Z     ) -> None:
2025-05-07T20:32:22.0634254Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0634325Z     
2025-05-07T20:32:22.0634499Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0634573Z     
2025-05-07T20:32:22.0634662Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0634794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0634922Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0635005Z         x0 = x[:, :D]
2025-05-07T20:32:22.0635125Z         x1 = x[:, D:]
2025-05-07T20:32:22.0635197Z     
2025-05-07T20:32:22.0635285Z         if contiguous:
2025-05-07T20:32:22.0635375Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0635462Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0635540Z     
2025-05-07T20:32:22.0635628Z         if scale_ub is not None:
2025-05-07T20:32:22.0635731Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0635871Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0635944Z             )
2025-05-07T20:32:22.0636021Z         else:
2025-05-07T20:32:22.0636118Z             scale_ub_tensor = None
2025-05-07T20:32:22.0636190Z     
2025-05-07T20:32:22.0636326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0636415Z             op = silu_mul_quant
2025-05-07T20:32:22.0636494Z             if compiled:
2025-05-07T20:32:22.0636600Z                 op = torch.compile(op)
2025-05-07T20:32:22.0636705Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0636822Z     
2025-05-07T20:32:22.0636917Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0636922Z 
2025-05-07T20:32:22.0637017Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0637142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0637247Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0637346Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0637850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0637944Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0638304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0638764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0639160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0639262Z     kernel = self.compile(
2025-05-07T20:32:22.0639655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0639825Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0639955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0639959Z 
2025-05-07T20:32:22.0640162Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a266b910>
2025-05-07T20:32:22.0640930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0641530Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a268c9a0>}
2025-05-07T20:32:22.0642287Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0642484Z context = <triton._C.libtriton.ir.context object at 0x7f18a2627f30>
2025-05-07T20:32:22.0642488Z 
2025-05-07T20:32:22.0642656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0642924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0643028Z                            module_map=module_map)
2025-05-07T20:32:22.0643189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0643292Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0643446Z E       ^
2025-05-07T20:32:22.0643873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0643968Z 
2025-05-07T20:32:22.0644391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0644396Z 
2025-05-07T20:32:22.0644496Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0644726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0644801Z     T=2048,
2025-05-07T20:32:22.0644875Z     D=7168,
2025-05-07T20:32:22.0644962Z     scale_ub=None,
2025-05-07T20:32:22.0645047Z     contiguous=False,
2025-05-07T20:32:22.0645127Z     compiled=True,
2025-05-07T20:32:22.0645205Z )
2025-05-07T20:32:22.0645419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0645590Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0645601Z 
2025-05-07T20:32:22.0645679Z     @given(
2025-05-07T20:32:22.0645796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0645994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0646113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0646227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0646342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0646419Z     )
2025-05-07T20:32:22.0651477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0651585Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0651669Z         self,
2025-05-07T20:32:22.0651745Z         T: int,
2025-05-07T20:32:22.0651817Z         D: int,
2025-05-07T20:32:22.0651916Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0652007Z         contiguous: bool,
2025-05-07T20:32:22.0652092Z         compiled: bool,
2025-05-07T20:32:22.0652175Z     ) -> None:
2025-05-07T20:32:22.0652268Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0652343Z     
2025-05-07T20:32:22.0652523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0652602Z     
2025-05-07T20:32:22.0652692Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0652822Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0652908Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0652992Z         x0 = x[:, :D]
2025-05-07T20:32:22.0653067Z         x1 = x[:, D:]
2025-05-07T20:32:22.0653137Z     
2025-05-07T20:32:22.0653222Z         if contiguous:
2025-05-07T20:32:22.0653310Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0653397Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0653473Z     
2025-05-07T20:32:22.0653561Z         if scale_ub is not None:
2025-05-07T20:32:22.0653665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0653802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0653876Z             )
2025-05-07T20:32:22.0653951Z         else:
2025-05-07T20:32:22.0654045Z             scale_ub_tensor = None
2025-05-07T20:32:22.0654186Z     
2025-05-07T20:32:22.0654323Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0654411Z             op = silu_mul_quant
2025-05-07T20:32:22.0654496Z             if compiled:
2025-05-07T20:32:22.0654600Z                 op = torch.compile(op)
2025-05-07T20:32:22.0654703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0654776Z     
2025-05-07T20:32:22.0654871Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0654876Z 
2025-05-07T20:32:22.0654971Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0655102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0655205Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0655302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0655677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0655812Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0656314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0656456Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0656811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0657031Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0657376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0657465Z     kernel = self.compile(
2025-05-07T20:32:22.0657853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0658024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0658153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0658160Z 
2025-05-07T20:32:22.0658415Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1d62a10>
2025-05-07T20:32:22.0659184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0659684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a268dd00>}
2025-05-07T20:32:22.0660434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0660623Z context = <triton._C.libtriton.ir.context object at 0x7f16f1da3170>
2025-05-07T20:32:22.0660632Z 
2025-05-07T20:32:22.0660801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0661069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0661181Z                            module_map=module_map)
2025-05-07T20:32:22.0661343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0661438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0661520Z E       ^
2025-05-07T20:32:22.0661873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0661878Z 
2025-05-07T20:32:22.0662296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0662301Z 
2025-05-07T20:32:22.0662400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0662621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0662705Z     T=4096,
2025-05-07T20:32:22.0662825Z     D=7168,
2025-05-07T20:32:22.0662910Z     scale_ub=None,
2025-05-07T20:32:22.0662997Z     contiguous=False,
2025-05-07T20:32:22.0663077Z     compiled=True,
2025-05-07T20:32:22.0663149Z )
2025-05-07T20:32:22.0663370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0663539Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0663543Z 
2025-05-07T20:32:22.0663622Z     @given(
2025-05-07T20:32:22.0663736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0663832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0663947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0664061Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0664172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0664251Z     )
2025-05-07T20:32:22.0664535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0664670Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0664745Z         self,
2025-05-07T20:32:22.0664817Z         T: int,
2025-05-07T20:32:22.0664894Z         D: int,
2025-05-07T20:32:22.0664988Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0665075Z         contiguous: bool,
2025-05-07T20:32:22.0665162Z         compiled: bool,
2025-05-07T20:32:22.0665238Z     ) -> None:
2025-05-07T20:32:22.0665331Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0665406Z     
2025-05-07T20:32:22.0665573Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0665642Z     
2025-05-07T20:32:22.0665734Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0665857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0665945Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0666024Z         x0 = x[:, :D]
2025-05-07T20:32:22.0666100Z         x1 = x[:, D:]
2025-05-07T20:32:22.0666176Z     
2025-05-07T20:32:22.0666260Z         if contiguous:
2025-05-07T20:32:22.0666352Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0666490Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0666562Z     
2025-05-07T20:32:22.0666652Z         if scale_ub is not None:
2025-05-07T20:32:22.0666760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0666891Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0666965Z             )
2025-05-07T20:32:22.0667043Z         else:
2025-05-07T20:32:22.0667135Z             scale_ub_tensor = None
2025-05-07T20:32:22.0667206Z     
2025-05-07T20:32:22.0667340Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0667430Z             op = silu_mul_quant
2025-05-07T20:32:22.0667519Z             if compiled:
2025-05-07T20:32:22.0667618Z                 op = torch.compile(op)
2025-05-07T20:32:22.0667722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0667800Z     
2025-05-07T20:32:22.0667892Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0667899Z 
2025-05-07T20:32:22.0668000Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0668132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0668230Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0668330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0668708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0668798Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0669297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0669393Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0669749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0669979Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0670364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0670467Z     kernel = self.compile(
2025-05-07T20:32:22.0670850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0671022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0671153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0671157Z 
2025-05-07T20:32:22.0671361Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1d2c350>
2025-05-07T20:32:22.0672132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0672675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a268e840>}
2025-05-07T20:32:22.0673466Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0673657Z context = <triton._C.libtriton.ir.context object at 0x7f16f1dac930>
2025-05-07T20:32:22.0673662Z 
2025-05-07T20:32:22.0673824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0674091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0674198Z                            module_map=module_map)
2025-05-07T20:32:22.0674359Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0674461Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0674539Z E       ^
2025-05-07T20:32:22.0674897Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0674944Z 
2025-05-07T20:32:22.0675368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0675373Z 
2025-05-07T20:32:22.0675478Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0675703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0675778Z     T=16384,
2025-05-07T20:32:22.0675854Z     D=5120,
2025-05-07T20:32:22.0675938Z     scale_ub=1200.0,
2025-05-07T20:32:22.0676016Z     contiguous=False,
2025-05-07T20:32:22.0676099Z     compiled=False,
2025-05-07T20:32:22.0676171Z )
2025-05-07T20:32:22.0676390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0676568Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0676575Z 
2025-05-07T20:32:22.0676654Z     @given(
2025-05-07T20:32:22.0676777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0676872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0676984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0677097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0677205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0677281Z     )
2025-05-07T20:32:22.0677521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0677614Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0677685Z         self,
2025-05-07T20:32:22.0677757Z         T: int,
2025-05-07T20:32:22.0677835Z         D: int,
2025-05-07T20:32:22.0677927Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0678016Z         contiguous: bool,
2025-05-07T20:32:22.0678103Z         compiled: bool,
2025-05-07T20:32:22.0678176Z     ) -> None:
2025-05-07T20:32:22.0678271Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0678392Z     
2025-05-07T20:32:22.0678562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0678634Z     
2025-05-07T20:32:22.0678725Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0678843Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0678930Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0679010Z         x0 = x[:, :D]
2025-05-07T20:32:22.0679085Z         x1 = x[:, D:]
2025-05-07T20:32:22.0679158Z     
2025-05-07T20:32:22.0679238Z         if contiguous:
2025-05-07T20:32:22.0679324Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0679415Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0679485Z     
2025-05-07T20:32:22.0679572Z         if scale_ub is not None:
2025-05-07T20:32:22.0679680Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0679808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0679922Z             )
2025-05-07T20:32:22.0680002Z         else:
2025-05-07T20:32:22.0680158Z             scale_ub_tensor = None
2025-05-07T20:32:22.0680230Z     
2025-05-07T20:32:22.0680359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0680447Z             op = silu_mul_quant
2025-05-07T20:32:22.0680534Z             if compiled:
2025-05-07T20:32:22.0680629Z                 op = torch.compile(op)
2025-05-07T20:32:22.0680730Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0680802Z     
2025-05-07T20:32:22.0680891Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0680896Z 
2025-05-07T20:32:22.0680993Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0681121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0681218Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0681312Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0681817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0681957Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0682319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0682539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0682877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0682969Z     kernel = self.compile(
2025-05-07T20:32:22.0683414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0683593Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0683717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0683722Z 
2025-05-07T20:32:22.0683927Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a27c9490>
2025-05-07T20:32:22.0684702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0685199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2784040>}
2025-05-07T20:32:22.0685951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0686139Z context = <triton._C.libtriton.ir.context object at 0x7f18a2725ab0>
2025-05-07T20:32:22.0686144Z 
2025-05-07T20:32:22.0686304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0686691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0686799Z                            module_map=module_map)
2025-05-07T20:32:22.0686959Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0687051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0687122Z E       ^
2025-05-07T20:32:22.0687476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0687481Z 
2025-05-07T20:32:22.0687895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0687900Z 
2025-05-07T20:32:22.0688007Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0688226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0688302Z     T=16384,
2025-05-07T20:32:22.0688380Z     D=5120,
2025-05-07T20:32:22.0688502Z     scale_ub=1200.0,
2025-05-07T20:32:22.0688584Z     contiguous=True,
2025-05-07T20:32:22.0688708Z     compiled=True,
2025-05-07T20:32:22.0688781Z )
2025-05-07T20:32:22.0688994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0689172Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.0689177Z 
2025-05-07T20:32:22.0689250Z     @given(
2025-05-07T20:32:22.0689369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0689462Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0689572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0689686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0689795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0689866Z     )
2025-05-07T20:32:22.0690112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0690202Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0690278Z         self,
2025-05-07T20:32:22.0690359Z         T: int,
2025-05-07T20:32:22.0690478Z         D: int,
2025-05-07T20:32:22.0690574Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0690669Z         contiguous: bool,
2025-05-07T20:32:22.0690752Z         compiled: bool,
2025-05-07T20:32:22.0690830Z     ) -> None:
2025-05-07T20:32:22.0690921Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0690995Z     
2025-05-07T20:32:22.0691164Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0691237Z     
2025-05-07T20:32:22.0691324Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0691448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0691531Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0691607Z         x0 = x[:, :D]
2025-05-07T20:32:22.0691684Z         x1 = x[:, D:]
2025-05-07T20:32:22.0691754Z     
2025-05-07T20:32:22.0691833Z         if contiguous:
2025-05-07T20:32:22.0691926Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0692014Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0692088Z     
2025-05-07T20:32:22.0692181Z         if scale_ub is not None:
2025-05-07T20:32:22.0692282Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0692418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0692488Z             )
2025-05-07T20:32:22.0692559Z         else:
2025-05-07T20:32:22.0692653Z             scale_ub_tensor = None
2025-05-07T20:32:22.0692722Z     
2025-05-07T20:32:22.0692847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0692938Z             op = silu_mul_quant
2025-05-07T20:32:22.0693018Z             if compiled:
2025-05-07T20:32:22.0693115Z                 op = torch.compile(op)
2025-05-07T20:32:22.0693223Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0693292Z     
2025-05-07T20:32:22.0693377Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0693388Z 
2025-05-07T20:32:22.0693483Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0693654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0693760Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0693857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0694227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0694321Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0694815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0694914Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0695272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0695491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0695873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0696001Z     kernel = self.compile(
2025-05-07T20:32:22.0696385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0696560Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0696684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0696688Z 
2025-05-07T20:32:22.0696898Z self = <triton.compiler.compiler.ASTSource object at 0x7f18a2798310>
2025-05-07T20:32:22.0697667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0698164Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2785300>}
2025-05-07T20:32:22.0698922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0699150Z context = <triton._C.libtriton.ir.context object at 0x7f18a27f09b0>
2025-05-07T20:32:22.0699154Z 
2025-05-07T20:32:22.0699318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0699577Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0699679Z                            module_map=module_map)
2025-05-07T20:32:22.0699838Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0699934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0700009Z E       ^
2025-05-07T20:32:22.0700364Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0700372Z 
2025-05-07T20:32:22.0700808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0700814Z 
2025-05-07T20:32:22.0700928Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0701168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0701246Z     T=16384,
2025-05-07T20:32:22.0701320Z     D=5120,
2025-05-07T20:32:22.0701395Z     scale_ub=None,
2025-05-07T20:32:22.0701477Z     contiguous=False,
2025-05-07T20:32:22.0701554Z     compiled=True,
2025-05-07T20:32:22.0701622Z )
2025-05-07T20:32:22.0701841Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0702011Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0702016Z 
2025-05-07T20:32:22.0702089Z     @given(
2025-05-07T20:32:22.0702208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0702343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0702469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0702580Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0702688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0702766Z     )
2025-05-07T20:32:22.0703010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0703096Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0703173Z         self,
2025-05-07T20:32:22.0703248Z         T: int,
2025-05-07T20:32:22.0703319Z         D: int,
2025-05-07T20:32:22.0703420Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0703504Z         contiguous: bool,
2025-05-07T20:32:22.0703584Z         compiled: bool,
2025-05-07T20:32:22.0703661Z     ) -> None:
2025-05-07T20:32:22.0703749Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0703827Z     
2025-05-07T20:32:22.0704034Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0704147Z     
2025-05-07T20:32:22.0704243Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0704365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0704448Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0704527Z         x0 = x[:, :D]
2025-05-07T20:32:22.0704604Z         x1 = x[:, D:]
2025-05-07T20:32:22.0704671Z     
2025-05-07T20:32:22.0704753Z         if contiguous:
2025-05-07T20:32:22.0704842Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0704930Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0705003Z     
2025-05-07T20:32:22.0705089Z         if scale_ub is not None:
2025-05-07T20:32:22.0705189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0705324Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0705396Z             )
2025-05-07T20:32:22.0705469Z         else:
2025-05-07T20:32:22.0705558Z             scale_ub_tensor = None
2025-05-07T20:32:22.0705632Z     
2025-05-07T20:32:22.0705765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0705902Z             op = silu_mul_quant
2025-05-07T20:32:22.0705983Z             if compiled:
2025-05-07T20:32:22.0706084Z                 op = torch.compile(op)
2025-05-07T20:32:22.0706185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0706255Z     
2025-05-07T20:32:22.0706343Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0706347Z 
2025-05-07T20:32:22.0706439Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0706569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0706663Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0706760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0707132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0707219Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0707717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0707820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0708177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0708398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0708745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0708835Z     kernel = self.compile(
2025-05-07T20:32:22.0709219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0709395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0709518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0709524Z 
2025-05-07T20:32:22.0709776Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1a8d7d0>
2025-05-07T20:32:22.0710547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0711045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2785e40>}
2025-05-07T20:32:22.0711795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0711982Z context = <triton._C.libtriton.ir.context object at 0x7f16f1a25e30>
2025-05-07T20:32:22.0711986Z 
2025-05-07T20:32:22.0712213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0712512Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0712617Z                            module_map=module_map)
2025-05-07T20:32:22.0712775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0712869Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0712948Z E       ^
2025-05-07T20:32:22.0713298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0713303Z 
2025-05-07T20:32:22.0713715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0713725Z 
2025-05-07T20:32:22.0713824Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0714042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0714118Z     T=2048,
2025-05-07T20:32:22.0714192Z     D=5120,
2025-05-07T20:32:22.0714271Z     scale_ub=None,
2025-05-07T20:32:22.0714402Z     contiguous=False,
2025-05-07T20:32:22.0714481Z     compiled=True,
2025-05-07T20:32:22.0714550Z )
2025-05-07T20:32:22.0714769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0714940Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0714945Z 
2025-05-07T20:32:22.0715025Z     @given(
2025-05-07T20:32:22.0715139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0715231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0715343Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0715453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0715562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0715636Z     )
2025-05-07T20:32:22.0715876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0715967Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0716047Z         self,
2025-05-07T20:32:22.0716122Z         T: int,
2025-05-07T20:32:22.0716192Z         D: int,
2025-05-07T20:32:22.0716288Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0716371Z         contiguous: bool,
2025-05-07T20:32:22.0716455Z         compiled: bool,
2025-05-07T20:32:22.0716529Z     ) -> None:
2025-05-07T20:32:22.0716620Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0716693Z     
2025-05-07T20:32:22.0716858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0716929Z     
2025-05-07T20:32:22.0717022Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0717139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0717223Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0717303Z         x0 = x[:, :D]
2025-05-07T20:32:22.0717378Z         x1 = x[:, D:]
2025-05-07T20:32:22.0717448Z     
2025-05-07T20:32:22.0717533Z         if contiguous:
2025-05-07T20:32:22.0717619Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0717759Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0717830Z     
2025-05-07T20:32:22.0717915Z         if scale_ub is not None:
2025-05-07T20:32:22.0718018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0718149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0718220Z             )
2025-05-07T20:32:22.0718297Z         else:
2025-05-07T20:32:22.0718386Z             scale_ub_tensor = None
2025-05-07T20:32:22.0718454Z     
2025-05-07T20:32:22.0718581Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0718665Z             op = silu_mul_quant
2025-05-07T20:32:22.0718744Z             if compiled:
2025-05-07T20:32:22.0718844Z                 op = torch.compile(op)
2025-05-07T20:32:22.0718944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0719010Z     
2025-05-07T20:32:22.0719100Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0719146Z 
2025-05-07T20:32:22.0719239Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0719410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0719506Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0719598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0719967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0720053Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0720563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0720672Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0721053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0721274Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0721615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0721746Z     kernel = self.compile(
2025-05-07T20:32:22.0722132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0722301Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0722428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0722432Z 
2025-05-07T20:32:22.0722633Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1a86a10>
2025-05-07T20:32:22.0723508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0724006Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f18a2787240>}
2025-05-07T20:32:22.0724758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0724948Z context = <triton._C.libtriton.ir.context object at 0x7f16f1a37030>
2025-05-07T20:32:22.0724952Z 
2025-05-07T20:32:22.0725112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0725373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0725477Z                            module_map=module_map)
2025-05-07T20:32:22.0725634Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0725730Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0725802Z E       ^
2025-05-07T20:32:22.0726198Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0726207Z 
2025-05-07T20:32:22.0726622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0726626Z 
2025-05-07T20:32:22.0726724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0726948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0727019Z     T=2048,
2025-05-07T20:32:22.0727091Z     D=5120,
2025-05-07T20:32:22.0727170Z     scale_ub=1200.0,
2025-05-07T20:32:22.0727249Z     contiguous=False,
2025-05-07T20:32:22.0727329Z     compiled=True,
2025-05-07T20:32:22.0727403Z )
2025-05-07T20:32:22.0727616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0727787Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0727791Z 
2025-05-07T20:32:22.0727911Z     @given(
2025-05-07T20:32:22.0728030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0728168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0728278Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0728393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0728505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0728574Z     )
2025-05-07T20:32:22.0728814Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0728903Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0728976Z         self,
2025-05-07T20:32:22.0729047Z         T: int,
2025-05-07T20:32:22.0729120Z         D: int,
2025-05-07T20:32:22.0729212Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0729295Z         contiguous: bool,
2025-05-07T20:32:22.0729378Z         compiled: bool,
2025-05-07T20:32:22.0729448Z     ) -> None:
2025-05-07T20:32:22.0729540Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0729613Z     
2025-05-07T20:32:22.0729783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0729899Z     
2025-05-07T20:32:22.0729986Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0730105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0730192Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0730267Z         x0 = x[:, :D]
2025-05-07T20:32:22.0730340Z         x1 = x[:, D:]
2025-05-07T20:32:22.0730411Z     
2025-05-07T20:32:22.0730491Z         if contiguous:
2025-05-07T20:32:22.0730576Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0730662Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0730730Z     
2025-05-07T20:32:22.0730814Z         if scale_ub is not None:
2025-05-07T20:32:22.0730917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0731046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0731121Z             )
2025-05-07T20:32:22.0731194Z         else:
2025-05-07T20:32:22.0731284Z             scale_ub_tensor = None
2025-05-07T20:32:22.0731360Z     
2025-05-07T20:32:22.0731486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0731571Z             op = silu_mul_quant
2025-05-07T20:32:22.0731658Z             if compiled:
2025-05-07T20:32:22.0731752Z                 op = torch.compile(op)
2025-05-07T20:32:22.0731850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0731923Z     
2025-05-07T20:32:22.0732009Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0732014Z 
2025-05-07T20:32:22.0732110Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0732234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0732329Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0732428Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0732797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0732888Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0733432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0733527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0733887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0734106Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0734446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0734538Z     kernel = self.compile(
2025-05-07T20:32:22.0734919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0735087Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0735255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0735301Z 
2025-05-07T20:32:22.0735506Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1aa7d90>
2025-05-07T20:32:22.0736274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0736764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1afc720>}
2025-05-07T20:32:22.0737516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0737699Z context = <triton._C.libtriton.ir.context object at 0x7f16f1a90430>
2025-05-07T20:32:22.0737706Z 
2025-05-07T20:32:22.0737869Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0738174Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0738277Z                            module_map=module_map)
2025-05-07T20:32:22.0738670Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0738818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0738899Z E       ^
2025-05-07T20:32:22.0739256Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0739261Z 
2025-05-07T20:32:22.0739672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0739677Z 
2025-05-07T20:32:22.0739773Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0740002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0740073Z     T=4096,
2025-05-07T20:32:22.0740155Z     D=5120,
2025-05-07T20:32:22.0740238Z     scale_ub=1200.0,
2025-05-07T20:32:22.0740314Z     contiguous=True,
2025-05-07T20:32:22.0740394Z     compiled=True,
2025-05-07T20:32:22.0740465Z )
2025-05-07T20:32:22.0740678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0740847Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.0740852Z 
2025-05-07T20:32:22.0740923Z     @given(
2025-05-07T20:32:22.0741040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0741136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0741245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0741362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0741470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0741539Z     )
2025-05-07T20:32:22.0741785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0742000Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0742073Z         self,
2025-05-07T20:32:22.0742152Z         T: int,
2025-05-07T20:32:22.0742226Z         D: int,
2025-05-07T20:32:22.0742318Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0742408Z         contiguous: bool,
2025-05-07T20:32:22.0742487Z         compiled: bool,
2025-05-07T20:32:22.0742560Z     ) -> None:
2025-05-07T20:32:22.0742655Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0742724Z     
2025-05-07T20:32:22.0742890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0742961Z     
2025-05-07T20:32:22.0743048Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0743170Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0743255Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0743328Z         x0 = x[:, :D]
2025-05-07T20:32:22.0743404Z         x1 = x[:, D:]
2025-05-07T20:32:22.0743532Z     
2025-05-07T20:32:22.0743612Z         if contiguous:
2025-05-07T20:32:22.0743763Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0743849Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0743917Z     
2025-05-07T20:32:22.0744005Z         if scale_ub is not None:
2025-05-07T20:32:22.0744104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0744234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0744305Z             )
2025-05-07T20:32:22.0744374Z         else:
2025-05-07T20:32:22.0744469Z             scale_ub_tensor = None
2025-05-07T20:32:22.0744536Z     
2025-05-07T20:32:22.0744659Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0744748Z             op = silu_mul_quant
2025-05-07T20:32:22.0744829Z             if compiled:
2025-05-07T20:32:22.0744923Z                 op = torch.compile(op)
2025-05-07T20:32:22.0745028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0745096Z     
2025-05-07T20:32:22.0745183Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0745191Z 
2025-05-07T20:32:22.0745354Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0745478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0745578Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0745671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0746039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0746131Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0746623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0746714Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0747073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0747292Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0747640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0747728Z     kernel = self.compile(
2025-05-07T20:32:22.0748109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0748283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0748405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0748409Z 
2025-05-07T20:32:22.0748613Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f19312d0>
2025-05-07T20:32:22.0749374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0749906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1afd260>}
2025-05-07T20:32:22.0750661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0750847Z context = <triton._C.libtriton.ir.context object at 0x7f16f192d930>
2025-05-07T20:32:22.0750852Z 
2025-05-07T20:32:22.0751014Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0751273Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0751371Z                            module_map=module_map)
2025-05-07T20:32:22.0751530Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0751624Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0751737Z E       ^
2025-05-07T20:32:22.0752096Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0752140Z 
2025-05-07T20:32:22.0752550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0752555Z 
2025-05-07T20:32:22.0752655Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0752872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0752946Z     T=128,
2025-05-07T20:32:22.0753022Z     D=5120,
2025-05-07T20:32:22.0753101Z     scale_ub=1200.0,
2025-05-07T20:32:22.0753187Z     contiguous=False,
2025-05-07T20:32:22.0753266Z     compiled=True,
2025-05-07T20:32:22.0753337Z )
2025-05-07T20:32:22.0753557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0753729Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0753734Z 
2025-05-07T20:32:22.0753812Z     @given(
2025-05-07T20:32:22.0753984Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0754079Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0754191Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0754312Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0754422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0754497Z     )
2025-05-07T20:32:22.0754745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0754835Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0754915Z         self,
2025-05-07T20:32:22.0754991Z         T: int,
2025-05-07T20:32:22.0755067Z         D: int,
2025-05-07T20:32:22.0755170Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0755257Z         contiguous: bool,
2025-05-07T20:32:22.0755338Z         compiled: bool,
2025-05-07T20:32:22.0755421Z     ) -> None:
2025-05-07T20:32:22.0755515Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0755594Z     
2025-05-07T20:32:22.0755767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0755844Z     
2025-05-07T20:32:22.0755940Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0756067Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0756156Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0756239Z         x0 = x[:, :D]
2025-05-07T20:32:22.0756317Z         x1 = x[:, D:]
2025-05-07T20:32:22.0756388Z     
2025-05-07T20:32:22.0756474Z         if contiguous:
2025-05-07T20:32:22.0756561Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0756651Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0756724Z     
2025-05-07T20:32:22.0756812Z         if scale_ub is not None:
2025-05-07T20:32:22.0756914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0757053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0757132Z             )
2025-05-07T20:32:22.0757207Z         else:
2025-05-07T20:32:22.0757358Z             scale_ub_tensor = None
2025-05-07T20:32:22.0757430Z     
2025-05-07T20:32:22.0757564Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0757651Z             op = silu_mul_quant
2025-05-07T20:32:22.0757734Z             if compiled:
2025-05-07T20:32:22.0757834Z                 op = torch.compile(op)
2025-05-07T20:32:22.0757940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0758010Z     
2025-05-07T20:32:22.0758101Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0758106Z 
2025-05-07T20:32:22.0758202Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0758329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0758433Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0758529Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0758947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0759077Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0759575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0759673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0760032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0760253Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0760602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0760693Z     kernel = self.compile(
2025-05-07T20:32:22.0761083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0761257Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0761386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0761433Z 
2025-05-07T20:32:22.0761646Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f197a590>
2025-05-07T20:32:22.0762413Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0762913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1afe480>}
2025-05-07T20:32:22.0763766Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0763967Z context = <triton._C.libtriton.ir.context object at 0x7f16f19d2bb0>
2025-05-07T20:32:22.0763975Z 
2025-05-07T20:32:22.0764146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0764405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0764514Z                            module_map=module_map)
2025-05-07T20:32:22.0764673Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0764766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0764848Z E       ^
2025-05-07T20:32:22.0765201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0765206Z 
2025-05-07T20:32:22.0765627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0765631Z 
2025-05-07T20:32:22.0765732Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0765998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0766087Z     T=16384,
2025-05-07T20:32:22.0766161Z     D=7168,
2025-05-07T20:32:22.0766240Z     scale_ub=1200.0,
2025-05-07T20:32:22.0766327Z     contiguous=True,
2025-05-07T20:32:22.0766409Z     compiled=True,
2025-05-07T20:32:22.0766486Z )
2025-05-07T20:32:22.0766701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0766872Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.0766877Z 
2025-05-07T20:32:22.0766958Z     @given(
2025-05-07T20:32:22.0767074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0767169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0767285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0767398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0767508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0767625Z     )
2025-05-07T20:32:22.0767872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0768010Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0768084Z         self,
2025-05-07T20:32:22.0768157Z         T: int,
2025-05-07T20:32:22.0768239Z         D: int,
2025-05-07T20:32:22.0768334Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0768419Z         contiguous: bool,
2025-05-07T20:32:22.0768509Z         compiled: bool,
2025-05-07T20:32:22.0768584Z     ) -> None:
2025-05-07T20:32:22.0771939Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0772021Z     
2025-05-07T20:32:22.0772202Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0772274Z     
2025-05-07T20:32:22.0772364Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0772489Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0772578Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0772657Z         x0 = x[:, :D]
2025-05-07T20:32:22.0772737Z         x1 = x[:, D:]
2025-05-07T20:32:22.0772905Z     
2025-05-07T20:32:22.0772986Z         if contiguous:
2025-05-07T20:32:22.0773080Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0773168Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0773238Z     
2025-05-07T20:32:22.0773330Z         if scale_ub is not None:
2025-05-07T20:32:22.0773431Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0773567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0773640Z             )
2025-05-07T20:32:22.0773710Z         else:
2025-05-07T20:32:22.0773805Z             scale_ub_tensor = None
2025-05-07T20:32:22.0773874Z     
2025-05-07T20:32:22.0774000Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0774090Z             op = silu_mul_quant
2025-05-07T20:32:22.0774172Z             if compiled:
2025-05-07T20:32:22.0774267Z                 op = torch.compile(op)
2025-05-07T20:32:22.0774376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0774447Z     
2025-05-07T20:32:22.0774539Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0774547Z 
2025-05-07T20:32:22.0774641Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0774768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0774868Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0774964Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0775331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0775425Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0775916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0776012Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0776375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0776642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0776995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0777086Z     kernel = self.compile(
2025-05-07T20:32:22.0777469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0777645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0777769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0777773Z 
2025-05-07T20:32:22.0777977Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f17b7a90>
2025-05-07T20:32:22.0778783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0779318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1affd80>}
2025-05-07T20:32:22.0780073Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0780260Z context = <triton._C.libtriton.ir.context object at 0x7f16f17f00f0>
2025-05-07T20:32:22.0780265Z 
2025-05-07T20:32:22.0780432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0780691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0780794Z                            module_map=module_map)
2025-05-07T20:32:22.0780959Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0781057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0781141Z E       ^
2025-05-07T20:32:22.0781537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0781542Z 
2025-05-07T20:32:22.0781954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0781959Z 
2025-05-07T20:32:22.0782060Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0782279Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0782359Z     T=16384,
2025-05-07T20:32:22.0782435Z     D=5120,
2025-05-07T20:32:22.0782512Z     scale_ub=1200.0,
2025-05-07T20:32:22.0782596Z     contiguous=True,
2025-05-07T20:32:22.0782676Z     compiled=False,
2025-05-07T20:32:22.0782753Z )
2025-05-07T20:32:22.0782970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0783150Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.0783159Z 
2025-05-07T20:32:22.0783231Z     @given(
2025-05-07T20:32:22.0783351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0783449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0783561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0783677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0783786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0783860Z     )
2025-05-07T20:32:22.0784104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0784193Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0784266Z         self,
2025-05-07T20:32:22.0784338Z         T: int,
2025-05-07T20:32:22.0784412Z         D: int,
2025-05-07T20:32:22.0784508Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0784592Z         contiguous: bool,
2025-05-07T20:32:22.0784674Z         compiled: bool,
2025-05-07T20:32:22.0784758Z     ) -> None:
2025-05-07T20:32:22.0784897Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0784969Z     
2025-05-07T20:32:22.0785140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0785210Z     
2025-05-07T20:32:22.0785299Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0785418Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0785506Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0785583Z         x0 = x[:, :D]
2025-05-07T20:32:22.0785658Z         x1 = x[:, D:]
2025-05-07T20:32:22.0785727Z     
2025-05-07T20:32:22.0785810Z         if contiguous:
2025-05-07T20:32:22.0785898Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0785984Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0786058Z     
2025-05-07T20:32:22.0786143Z         if scale_ub is not None:
2025-05-07T20:32:22.0786246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0786423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0786537Z             )
2025-05-07T20:32:22.0786613Z         else:
2025-05-07T20:32:22.0786704Z             scale_ub_tensor = None
2025-05-07T20:32:22.0786772Z     
2025-05-07T20:32:22.0786902Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0786990Z             op = silu_mul_quant
2025-05-07T20:32:22.0787072Z             if compiled:
2025-05-07T20:32:22.0787168Z                 op = torch.compile(op)
2025-05-07T20:32:22.0787270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0787337Z     
2025-05-07T20:32:22.0787426Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0787431Z 
2025-05-07T20:32:22.0787525Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0787653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0787747Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0787844Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0788349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0788487Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0788847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0789071Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0789409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0789505Z     kernel = self.compile(
2025-05-07T20:32:22.0789889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0790061Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0790188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0790195Z 
2025-05-07T20:32:22.0790401Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1730c90>
2025-05-07T20:32:22.0791214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0791726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1774cc0>}
2025-05-07T20:32:22.0792473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0792667Z context = <triton._C.libtriton.ir.context object at 0x7f16f17ad270>
2025-05-07T20:32:22.0792671Z 
2025-05-07T20:32:22.0792835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0793141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0793249Z                            module_map=module_map)
2025-05-07T20:32:22.0793407Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0793506Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0793582Z E       ^
2025-05-07T20:32:22.0793933Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0793941Z 
2025-05-07T20:32:22.0794353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0794358Z 
2025-05-07T20:32:22.0794454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0794675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0794747Z     T=1,
2025-05-07T20:32:22.0794861Z     D=7168,
2025-05-07T20:32:22.0794986Z     scale_ub=1200.0,
2025-05-07T20:32:22.0795073Z     contiguous=False,
2025-05-07T20:32:22.0795156Z     compiled=False,
2025-05-07T20:32:22.0795230Z )
2025-05-07T20:32:22.0795446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0795618Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0795622Z 
2025-05-07T20:32:22.0795696Z     @given(
2025-05-07T20:32:22.0795813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0795909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0796020Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0796132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0796248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0796317Z     )
2025-05-07T20:32:22.0796564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0796661Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0796782Z         self,
2025-05-07T20:32:22.0796858Z         T: int,
2025-05-07T20:32:22.0796932Z         D: int,
2025-05-07T20:32:22.0797025Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0797112Z         contiguous: bool,
2025-05-07T20:32:22.0797194Z         compiled: bool,
2025-05-07T20:32:22.0797270Z     ) -> None:
2025-05-07T20:32:22.0797361Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0797438Z     
2025-05-07T20:32:22.0797607Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0797678Z     
2025-05-07T20:32:22.0797769Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0797889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0797978Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0798053Z         x0 = x[:, :D]
2025-05-07T20:32:22.0798128Z         x1 = x[:, D:]
2025-05-07T20:32:22.0798202Z     
2025-05-07T20:32:22.0798283Z         if contiguous:
2025-05-07T20:32:22.0798372Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0798464Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0798533Z     
2025-05-07T20:32:22.0798622Z         if scale_ub is not None:
2025-05-07T20:32:22.0798726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0798855Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0798929Z             )
2025-05-07T20:32:22.0799005Z         else:
2025-05-07T20:32:22.0799095Z             scale_ub_tensor = None
2025-05-07T20:32:22.0799160Z     
2025-05-07T20:32:22.0799291Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0799377Z             op = silu_mul_quant
2025-05-07T20:32:22.0799461Z             if compiled:
2025-05-07T20:32:22.0799556Z                 op = torch.compile(op)
2025-05-07T20:32:22.0799656Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0799729Z     
2025-05-07T20:32:22.0799816Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0799823Z 
2025-05-07T20:32:22.0799960Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0800094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0800191Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0800286Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0800786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0800884Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0801247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0801465Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0801803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0801895Z     kernel = self.compile(
2025-05-07T20:32:22.0802317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0802554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0802676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0802681Z 
2025-05-07T20:32:22.0802884Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f17cc250>
2025-05-07T20:32:22.0803742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0804235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1775080>}
2025-05-07T20:32:22.0804992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0805221Z context = <triton._C.libtriton.ir.context object at 0x7f16f1798870>
2025-05-07T20:32:22.0805226Z 
2025-05-07T20:32:22.0805387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0805649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0805750Z                            module_map=module_map)
2025-05-07T20:32:22.0805912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0806005Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0806077Z E       ^
2025-05-07T20:32:22.0806431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0806436Z 
2025-05-07T20:32:22.0806854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0806865Z 
2025-05-07T20:32:22.0806966Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0807186Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0807259Z     T=4096,
2025-05-07T20:32:22.0807334Z     D=7168,
2025-05-07T20:32:22.0807412Z     scale_ub=1200.0,
2025-05-07T20:32:22.0807491Z     contiguous=False,
2025-05-07T20:32:22.0807568Z     compiled=True,
2025-05-07T20:32:22.0807635Z )
2025-05-07T20:32:22.0807848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0808020Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0808025Z 
2025-05-07T20:32:22.0808097Z     @given(
2025-05-07T20:32:22.0808217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0808311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0808425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0808584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0808696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0808768Z     )
2025-05-07T20:32:22.0809013Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0809101Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0809170Z         self,
2025-05-07T20:32:22.0809247Z         T: int,
2025-05-07T20:32:22.0809317Z         D: int,
2025-05-07T20:32:22.0809411Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0809495Z         contiguous: bool,
2025-05-07T20:32:22.0809574Z         compiled: bool,
2025-05-07T20:32:22.0809648Z     ) -> None:
2025-05-07T20:32:22.0809736Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0809803Z     
2025-05-07T20:32:22.0809971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0810042Z     
2025-05-07T20:32:22.0810172Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0810339Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0810426Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0810500Z         x0 = x[:, :D]
2025-05-07T20:32:22.0810578Z         x1 = x[:, D:]
2025-05-07T20:32:22.0810646Z     
2025-05-07T20:32:22.0810723Z         if contiguous:
2025-05-07T20:32:22.0810810Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0810894Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0810965Z     
2025-05-07T20:32:22.0811051Z         if scale_ub is not None:
2025-05-07T20:32:22.0811150Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0811283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0811353Z             )
2025-05-07T20:32:22.0811425Z         else:
2025-05-07T20:32:22.0811518Z             scale_ub_tensor = None
2025-05-07T20:32:22.0811585Z     
2025-05-07T20:32:22.0811711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0811802Z             op = silu_mul_quant
2025-05-07T20:32:22.0811887Z             if compiled:
2025-05-07T20:32:22.0812097Z                 op = torch.compile(op)
2025-05-07T20:32:22.0812197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0812265Z     
2025-05-07T20:32:22.0812356Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0812360Z 
2025-05-07T20:32:22.0812452Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0812577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0812674Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0812770Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0813136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0813227Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0813726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0813825Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0814188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0814407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0814748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0814835Z     kernel = self.compile(
2025-05-07T20:32:22.0815220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0815395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0815517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0815521Z 
2025-05-07T20:32:22.0815726Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1876a90>
2025-05-07T20:32:22.0816540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0817039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1777060>}
2025-05-07T20:32:22.0817790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0817977Z context = <triton._C.libtriton.ir.context object at 0x7f16f18470f0>
2025-05-07T20:32:22.0817981Z 
2025-05-07T20:32:22.0818145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0818440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0818586Z                            module_map=module_map)
2025-05-07T20:32:22.0818745Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0818838Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0818913Z E       ^
2025-05-07T20:32:22.0819263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0819268Z 
2025-05-07T20:32:22.0819678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0819683Z 
2025-05-07T20:32:22.0819785Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0820001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0820077Z     T=128,
2025-05-07T20:32:22.0820150Z     D=7168,
2025-05-07T20:32:22.0820225Z     scale_ub=1200.0,
2025-05-07T20:32:22.0820311Z     contiguous=False,
2025-05-07T20:32:22.0820388Z     compiled=True,
2025-05-07T20:32:22.0820465Z )
2025-05-07T20:32:22.0820747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0820937Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0820942Z 
2025-05-07T20:32:22.0821015Z     @given(
2025-05-07T20:32:22.0821131Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0821223Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0821338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0821451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0821559Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0821629Z     )
2025-05-07T20:32:22.0821868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0821958Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0822033Z         self,
2025-05-07T20:32:22.0822107Z         T: int,
2025-05-07T20:32:22.0822182Z         D: int,
2025-05-07T20:32:22.0822282Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0822367Z         contiguous: bool,
2025-05-07T20:32:22.0822446Z         compiled: bool,
2025-05-07T20:32:22.0822523Z     ) -> None:
2025-05-07T20:32:22.0822612Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0822685Z     
2025-05-07T20:32:22.0822849Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0822919Z     
2025-05-07T20:32:22.0823006Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0823125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0823206Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0823287Z         x0 = x[:, :D]
2025-05-07T20:32:22.0823360Z         x1 = x[:, D:]
2025-05-07T20:32:22.0823428Z     
2025-05-07T20:32:22.0823510Z         if contiguous:
2025-05-07T20:32:22.0823595Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0823683Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0823751Z     
2025-05-07T20:32:22.0823884Z         if scale_ub is not None:
2025-05-07T20:32:22.0823993Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0824123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0824193Z             )
2025-05-07T20:32:22.0824267Z         else:
2025-05-07T20:32:22.0824356Z             scale_ub_tensor = None
2025-05-07T20:32:22.0824423Z     
2025-05-07T20:32:22.0824551Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0824637Z             op = silu_mul_quant
2025-05-07T20:32:22.0824715Z             if compiled:
2025-05-07T20:32:22.0824815Z                 op = torch.compile(op)
2025-05-07T20:32:22.0824915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0824983Z     
2025-05-07T20:32:22.0825072Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0825076Z 
2025-05-07T20:32:22.0825167Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0825333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0825475Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0825570Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0825940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0826028Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0826522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0826617Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0826979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0827202Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0827541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0827629Z     kernel = self.compile(
2025-05-07T20:32:22.0828019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0828233Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0828357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0828365Z 
2025-05-07T20:32:22.0828568Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f18afd50>
2025-05-07T20:32:22.0829335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0829834Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f188c360>}
2025-05-07T20:32:22.0830583Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0830778Z context = <triton._C.libtriton.ir.context object at 0x7f16f18e03b0>
2025-05-07T20:32:22.0830782Z 
2025-05-07T20:32:22.0830947Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0831204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0831308Z                            module_map=module_map)
2025-05-07T20:32:22.0831468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0831561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0831636Z E       ^
2025-05-07T20:32:22.0831988Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0831995Z 
2025-05-07T20:32:22.0832451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0832460Z 
2025-05-07T20:32:22.0832559Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0832778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0832854Z     T=2048,
2025-05-07T20:32:22.0832927Z     D=7168,
2025-05-07T20:32:22.0833003Z     scale_ub=None,
2025-05-07T20:32:22.0833083Z     contiguous=True,
2025-05-07T20:32:22.0833156Z     compiled=True,
2025-05-07T20:32:22.0833224Z )
2025-05-07T20:32:22.0833448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0833612Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0833617Z 
2025-05-07T20:32:22.0833693Z     @given(
2025-05-07T20:32:22.0833807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0833966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0834123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0834236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0834344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0834415Z     )
2025-05-07T20:32:22.0834655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0834744Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0834820Z         self,
2025-05-07T20:32:22.0834893Z         T: int,
2025-05-07T20:32:22.0834971Z         D: int,
2025-05-07T20:32:22.0835064Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0835147Z         contiguous: bool,
2025-05-07T20:32:22.0835229Z         compiled: bool,
2025-05-07T20:32:22.0835302Z     ) -> None:
2025-05-07T20:32:22.0835392Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0835464Z     
2025-05-07T20:32:22.0835632Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0835702Z     
2025-05-07T20:32:22.0835798Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0835961Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0836043Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0836121Z         x0 = x[:, :D]
2025-05-07T20:32:22.0836197Z         x1 = x[:, D:]
2025-05-07T20:32:22.0836265Z     
2025-05-07T20:32:22.0836346Z         if contiguous:
2025-05-07T20:32:22.0836430Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0836514Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0836579Z     
2025-05-07T20:32:22.0836663Z         if scale_ub is not None:
2025-05-07T20:32:22.0836767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0836894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0836965Z             )
2025-05-07T20:32:22.0837040Z         else:
2025-05-07T20:32:22.0837129Z             scale_ub_tensor = None
2025-05-07T20:32:22.0837196Z     
2025-05-07T20:32:22.0837326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0837418Z             op = silu_mul_quant
2025-05-07T20:32:22.0837498Z             if compiled:
2025-05-07T20:32:22.0837601Z                 op = torch.compile(op)
2025-05-07T20:32:22.0837701Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0837772Z     
2025-05-07T20:32:22.0837859Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0837863Z 
2025-05-07T20:32:22.0837954Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0838081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0838177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0838270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0838915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0839009Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0839601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0839704Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0840062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0840284Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0840622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0840710Z     kernel = self.compile(
2025-05-07T20:32:22.0841093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0841264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0841388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0841393Z 
2025-05-07T20:32:22.0841654Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f1545310>
2025-05-07T20:32:22.0842477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0842971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f188cea0>}
2025-05-07T20:32:22.0843799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0843988Z context = <triton._C.libtriton.ir.context object at 0x7f16f15b1930>
2025-05-07T20:32:22.0843993Z 
2025-05-07T20:32:22.0844152Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0844417Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0844586Z                            module_map=module_map)
2025-05-07T20:32:22.0844743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0844840Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0844913Z E       ^
2025-05-07T20:32:22.0845262Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0845267Z 
2025-05-07T20:32:22.0845681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0845686Z 
2025-05-07T20:32:22.0845783Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0846007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0846081Z     T=16384,
2025-05-07T20:32:22.0846154Z     D=5120,
2025-05-07T20:32:22.0846238Z     scale_ub=None,
2025-05-07T20:32:22.0846319Z     contiguous=False,
2025-05-07T20:32:22.0846405Z     compiled=False,
2025-05-07T20:32:22.0846477Z )
2025-05-07T20:32:22.0846689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0846859Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0846863Z 
2025-05-07T20:32:22.0846938Z     @given(
2025-05-07T20:32:22.0847051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0847143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0847251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0847361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0847470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0847539Z     )
2025-05-07T20:32:22.0847782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0847876Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0847946Z         self,
2025-05-07T20:32:22.0848071Z         T: int,
2025-05-07T20:32:22.0848149Z         D: int,
2025-05-07T20:32:22.0848241Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0848326Z         contiguous: bool,
2025-05-07T20:32:22.0848407Z         compiled: bool,
2025-05-07T20:32:22.0848481Z     ) -> None:
2025-05-07T20:32:22.0848573Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0848643Z     
2025-05-07T20:32:22.0848808Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0848877Z     
2025-05-07T20:32:22.0848963Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0849081Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0850933Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0850976Z 
2025-05-07T20:32:22.0851092Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:22.0851097Z 
2025-05-07T20:32:22.0851197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0851415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0851491Z     T=4096,
2025-05-07T20:32:22.0851564Z     D=7168,
2025-05-07T20:32:22.0851641Z     scale_ub=1200.0,
2025-05-07T20:32:22.0851723Z     contiguous=True,
2025-05-07T20:32:22.0851799Z     compiled=True,
2025-05-07T20:32:22.0851868Z )
2025-05-07T20:32:22.0852083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0852252Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.0852305Z 
2025-05-07T20:32:22.0852379Z     @given(
2025-05-07T20:32:22.0852493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0852586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0852693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0852804Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0852910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0852982Z     )
2025-05-07T20:32:22.0853220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0853308Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0853385Z         self,
2025-05-07T20:32:22.0853459Z         T: int,
2025-05-07T20:32:22.0853528Z         D: int,
2025-05-07T20:32:22.0853622Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0853705Z         contiguous: bool,
2025-05-07T20:32:22.0853788Z         compiled: bool,
2025-05-07T20:32:22.0853865Z     ) -> None:
2025-05-07T20:32:22.0853962Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0854030Z     
2025-05-07T20:32:22.0854198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0854266Z     
2025-05-07T20:32:22.0854356Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0854476Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0856258Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0856267Z 
2025-05-07T20:32:22.0856424Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:22.0856432Z 
2025-05-07T20:32:22.0856530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0856752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0856825Z     T=16384,
2025-05-07T20:32:22.0856898Z     D=7168,
2025-05-07T20:32:22.0856977Z     scale_ub=None,
2025-05-07T20:32:22.0857060Z     contiguous=False,
2025-05-07T20:32:22.0857138Z     compiled=False,
2025-05-07T20:32:22.0857211Z )
2025-05-07T20:32:22.0857420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0857594Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0857598Z 
2025-05-07T20:32:22.0857673Z     @given(
2025-05-07T20:32:22.0857787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0857921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0858033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0858182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0858292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0858361Z     )
2025-05-07T20:32:22.0858604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0858692Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0858765Z         self,
2025-05-07T20:32:22.0858839Z         T: int,
2025-05-07T20:32:22.0858912Z         D: int,
2025-05-07T20:32:22.0859001Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0859086Z         contiguous: bool,
2025-05-07T20:32:22.0859167Z         compiled: bool,
2025-05-07T20:32:22.0859239Z     ) -> None:
2025-05-07T20:32:22.0859328Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0859396Z     
2025-05-07T20:32:22.0859559Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0861348Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0861397Z 
2025-05-07T20:32:22.0861512Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0861519Z 
2025-05-07T20:32:22.0861620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0861838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0861914Z     T=2048,
2025-05-07T20:32:22.0861987Z     D=7168,
2025-05-07T20:32:22.0862066Z     scale_ub=1200.0,
2025-05-07T20:32:22.0862150Z     contiguous=True,
2025-05-07T20:32:22.0862234Z     compiled=True,
2025-05-07T20:32:22.0862308Z )
2025-05-07T20:32:22.0862523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0862687Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.0862692Z 
2025-05-07T20:32:22.0862765Z     @given(
2025-05-07T20:32:22.0862884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0862979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0863091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0863201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0863307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0863381Z     )
2025-05-07T20:32:22.0863619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0863706Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0863784Z         self,
2025-05-07T20:32:22.0863858Z         T: int,
2025-05-07T20:32:22.0863998Z         D: int,
2025-05-07T20:32:22.0864094Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0864176Z         contiguous: bool,
2025-05-07T20:32:22.0864258Z         compiled: bool,
2025-05-07T20:32:22.0864329Z     ) -> None:
2025-05-07T20:32:22.0864418Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0864486Z     
2025-05-07T20:32:22.0864651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0864721Z     
2025-05-07T20:32:22.0864808Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0864927Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0866730Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0866778Z 
2025-05-07T20:32:22.0866889Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:22.0866894Z 
2025-05-07T20:32:22.0866990Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0867208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0867280Z     T=2048,
2025-05-07T20:32:22.0867356Z     D=7168,
2025-05-07T20:32:22.0867433Z     scale_ub=None,
2025-05-07T20:32:22.0867514Z     contiguous=True,
2025-05-07T20:32:22.0867598Z     compiled=False,
2025-05-07T20:32:22.0867668Z )
2025-05-07T20:32:22.0867877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0868049Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0868056Z 
2025-05-07T20:32:22.0868175Z     @given(
2025-05-07T20:32:22.0868287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0868383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0868490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0868599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0868709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0868778Z     )
2025-05-07T20:32:22.0869019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0869107Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0869181Z         self,
2025-05-07T20:32:22.0869254Z         T: int,
2025-05-07T20:32:22.0869325Z         D: int,
2025-05-07T20:32:22.0869417Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0869504Z         contiguous: bool,
2025-05-07T20:32:22.0869584Z         compiled: bool,
2025-05-07T20:32:22.0869658Z     ) -> None:
2025-05-07T20:32:22.0869755Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0869826Z     
2025-05-07T20:32:22.0869989Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0870061Z     
2025-05-07T20:32:22.0870147Z >       x_sign = torch.sign(x)
2025-05-07T20:32:22.0871911Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0871918Z 
2025-05-07T20:32:22.0872032Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:22.0872036Z 
2025-05-07T20:32:22.0872182Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0872403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0872475Z     T=1,
2025-05-07T20:32:22.0872550Z     D=7168,
2025-05-07T20:32:22.0872628Z     scale_ub=1200.0,
2025-05-07T20:32:22.0872709Z     contiguous=True,
2025-05-07T20:32:22.0872790Z     compiled=False,
2025-05-07T20:32:22.0872860Z )
2025-05-07T20:32:22.0873070Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0873233Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.0873237Z 
2025-05-07T20:32:22.0873310Z     @given(
2025-05-07T20:32:22.0873424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0873518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0873625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0873780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0873929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0874003Z     )
2025-05-07T20:32:22.0874244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0874332Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0874407Z         self,
2025-05-07T20:32:22.0874479Z         T: int,
2025-05-07T20:32:22.0874549Z         D: int,
2025-05-07T20:32:22.0874651Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0874734Z         contiguous: bool,
2025-05-07T20:32:22.0874812Z         compiled: bool,
2025-05-07T20:32:22.0874889Z     ) -> None:
2025-05-07T20:32:22.0874979Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0875048Z     
2025-05-07T20:32:22.0875217Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0875289Z     
2025-05-07T20:32:22.0875375Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0875501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0875590Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0875711Z         x0 = x[:, :D]
2025-05-07T20:32:22.0875787Z         x1 = x[:, D:]
2025-05-07T20:32:22.0875855Z     
2025-05-07T20:32:22.0875939Z         if contiguous:
2025-05-07T20:32:22.0876027Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0876115Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0876190Z     
2025-05-07T20:32:22.0876277Z         if scale_ub is not None:
2025-05-07T20:32:22.0876375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0876509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0876581Z             )
2025-05-07T20:32:22.0876652Z         else:
2025-05-07T20:32:22.0876743Z             scale_ub_tensor = None
2025-05-07T20:32:22.0876811Z     
2025-05-07T20:32:22.0876936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0877025Z             op = silu_mul_quant
2025-05-07T20:32:22.0877107Z             if compiled:
2025-05-07T20:32:22.0877205Z                 op = torch.compile(op)
2025-05-07T20:32:22.0877310Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0877379Z     
2025-05-07T20:32:22.0877466Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0877470Z 
2025-05-07T20:32:22.0877563Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0877687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0877782Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0877877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0878378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0878473Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0878831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0879058Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0879445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0879538Z     kernel = self.compile(
2025-05-07T20:32:22.0879924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0880095Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0880217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0880222Z 
2025-05-07T20:32:22.0880423Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f16bf450>
2025-05-07T20:32:22.0881238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0881782Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1644680>}
2025-05-07T20:32:22.0882573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0882763Z context = <triton._C.libtriton.ir.context object at 0x7f16f169fa70>
2025-05-07T20:32:22.0882768Z 
2025-05-07T20:32:22.0882928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0883188Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0883291Z                            module_map=module_map)
2025-05-07T20:32:22.0883547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0883643Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0883716Z E       ^
2025-05-07T20:32:22.0884070Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0884124Z 
2025-05-07T20:32:22.0884542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0884546Z 
2025-05-07T20:32:22.0884647Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0884869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0884943Z     T=128,
2025-05-07T20:32:22.0885013Z     D=5120,
2025-05-07T20:32:22.0885092Z     scale_ub=None,
2025-05-07T20:32:22.0885170Z     contiguous=True,
2025-05-07T20:32:22.0885249Z     compiled=False,
2025-05-07T20:32:22.0885322Z )
2025-05-07T20:32:22.0885535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0885706Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0885714Z 
2025-05-07T20:32:22.0885789Z     @given(
2025-05-07T20:32:22.0885911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0886002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0886116Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0886229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0886337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0886408Z     )
2025-05-07T20:32:22.0886648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0886739Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0886815Z         self,
2025-05-07T20:32:22.0886886Z         T: int,
2025-05-07T20:32:22.0886961Z         D: int,
2025-05-07T20:32:22.0887053Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0887136Z         contiguous: bool,
2025-05-07T20:32:22.0887219Z         compiled: bool,
2025-05-07T20:32:22.0887292Z     ) -> None:
2025-05-07T20:32:22.0887386Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0887509Z     
2025-05-07T20:32:22.0887678Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0887751Z     
2025-05-07T20:32:22.0887836Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0887956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0888043Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0888118Z         x0 = x[:, :D]
2025-05-07T20:32:22.0888191Z         x1 = x[:, D:]
2025-05-07T20:32:22.0888258Z     
2025-05-07T20:32:22.0888337Z         if contiguous:
2025-05-07T20:32:22.0891650Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0891750Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0891824Z     
2025-05-07T20:32:22.0891914Z         if scale_ub is not None:
2025-05-07T20:32:22.0892017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0892153Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0892295Z             )
2025-05-07T20:32:22.0892369Z         else:
2025-05-07T20:32:22.0892506Z             scale_ub_tensor = None
2025-05-07T20:32:22.0892577Z     
2025-05-07T20:32:22.0892706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0892796Z             op = silu_mul_quant
2025-05-07T20:32:22.0892881Z             if compiled:
2025-05-07T20:32:22.0892983Z                 op = torch.compile(op)
2025-05-07T20:32:22.0893085Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0893155Z     
2025-05-07T20:32:22.0893249Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0893254Z 
2025-05-07T20:32:22.0893348Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0893476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0893579Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0893676Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0894178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0894351Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0894709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0894933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0895274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0895365Z     kernel = self.compile(
2025-05-07T20:32:22.0895754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0895924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0896053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0896058Z 
2025-05-07T20:32:22.0896261Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f160e210>
2025-05-07T20:32:22.0897035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0897534Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f16458a0>}
2025-05-07T20:32:22.0898279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0898470Z context = <triton._C.libtriton.ir.context object at 0x7f16f16ca830>
2025-05-07T20:32:22.0898475Z 
2025-05-07T20:32:22.0898636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0898945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0899056Z                            module_map=module_map)
2025-05-07T20:32:22.0899217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0899315Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0899388Z E       ^
2025-05-07T20:32:22.0899740Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0899745Z 
2025-05-07T20:32:22.0900157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0900162Z 
2025-05-07T20:32:22.0900260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0900480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0900556Z     T=128,
2025-05-07T20:32:22.0900630Z     D=7168,
2025-05-07T20:32:22.0900752Z     scale_ub=None,
2025-05-07T20:32:22.0900837Z     contiguous=True,
2025-05-07T20:32:22.0900960Z     compiled=False,
2025-05-07T20:32:22.0901037Z )
2025-05-07T20:32:22.0901249Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0901415Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0901419Z 
2025-05-07T20:32:22.0901494Z     @given(
2025-05-07T20:32:22.0901611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0901708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0901818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0901929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0902041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0902109Z     )
2025-05-07T20:32:22.0902349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0902441Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0902517Z         self,
2025-05-07T20:32:22.0902590Z         T: int,
2025-05-07T20:32:22.0902718Z         D: int,
2025-05-07T20:32:22.0902813Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0902897Z         contiguous: bool,
2025-05-07T20:32:22.0902983Z         compiled: bool,
2025-05-07T20:32:22.0903058Z     ) -> None:
2025-05-07T20:32:22.0903154Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0903225Z     
2025-05-07T20:32:22.0903391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0903464Z     
2025-05-07T20:32:22.0903553Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0903673Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0903761Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0903836Z         x0 = x[:, :D]
2025-05-07T20:32:22.0903909Z         x1 = x[:, D:]
2025-05-07T20:32:22.0903983Z     
2025-05-07T20:32:22.0904061Z         if contiguous:
2025-05-07T20:32:22.0904153Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0904247Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0904321Z     
2025-05-07T20:32:22.0904412Z         if scale_ub is not None:
2025-05-07T20:32:22.0904519Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0904648Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0904723Z             )
2025-05-07T20:32:22.0904797Z         else:
2025-05-07T20:32:22.0904888Z             scale_ub_tensor = None
2025-05-07T20:32:22.0904961Z     
2025-05-07T20:32:22.0905086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0905173Z             op = silu_mul_quant
2025-05-07T20:32:22.0905260Z             if compiled:
2025-05-07T20:32:22.0905355Z                 op = torch.compile(op)
2025-05-07T20:32:22.0905455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0905532Z     
2025-05-07T20:32:22.0905619Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0905624Z 
2025-05-07T20:32:22.0905722Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0905906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0906015Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0906111Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0906609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0906700Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0907063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0907282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0907627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0907716Z     kernel = self.compile(
2025-05-07T20:32:22.0908140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0908355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0908481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0908486Z 
2025-05-07T20:32:22.0908686Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f13a10d0>
2025-05-07T20:32:22.0909455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0909947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f16467a0>}
2025-05-07T20:32:22.0910697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0910927Z context = <triton._C.libtriton.ir.context object at 0x7f16f135d6f0>
2025-05-07T20:32:22.0910932Z 
2025-05-07T20:32:22.0911097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0911358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0911459Z                            module_map=module_map)
2025-05-07T20:32:22.0911623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0911717Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0911791Z E       ^
2025-05-07T20:32:22.0912147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0912151Z 
2025-05-07T20:32:22.0912565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0912570Z 
2025-05-07T20:32:22.0912676Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0912897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0912971Z     T=2048,
2025-05-07T20:32:22.0913044Z     D=7168,
2025-05-07T20:32:22.0913121Z     scale_ub=1200.0,
2025-05-07T20:32:22.0913199Z     contiguous=True,
2025-05-07T20:32:22.0913282Z     compiled=False,
2025-05-07T20:32:22.0913349Z )
2025-05-07T20:32:22.0913566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0913735Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.0913740Z 
2025-05-07T20:32:22.0913817Z     @given(
2025-05-07T20:32:22.0913937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0914035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0914144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0914262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0914417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0914493Z     )
2025-05-07T20:32:22.0914738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0914829Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0914905Z         self,
2025-05-07T20:32:22.0914980Z         T: int,
2025-05-07T20:32:22.0915056Z         D: int,
2025-05-07T20:32:22.0915153Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0915239Z         contiguous: bool,
2025-05-07T20:32:22.0915321Z         compiled: bool,
2025-05-07T20:32:22.0915401Z     ) -> None:
2025-05-07T20:32:22.0915491Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0915562Z     
2025-05-07T20:32:22.0915733Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0917554Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0917598Z 
2025-05-07T20:32:22.0917726Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0917731Z 
2025-05-07T20:32:22.0917830Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0918048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0918127Z     T=1,
2025-05-07T20:32:22.0918201Z     D=5120,
2025-05-07T20:32:22.0918278Z     scale_ub=1200.0,
2025-05-07T20:32:22.0918364Z     contiguous=True,
2025-05-07T20:32:22.0918444Z     compiled=False,
2025-05-07T20:32:22.0918517Z )
2025-05-07T20:32:22.0918737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0918940Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.0918945Z 
2025-05-07T20:32:22.0919023Z     @given(
2025-05-07T20:32:22.0919138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0919231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0919342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0919453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0919561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0919637Z     )
2025-05-07T20:32:22.0919875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0919963Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0920037Z         self,
2025-05-07T20:32:22.0920111Z         T: int,
2025-05-07T20:32:22.0920184Z         D: int,
2025-05-07T20:32:22.0920286Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0920376Z         contiguous: bool,
2025-05-07T20:32:22.0920463Z         compiled: bool,
2025-05-07T20:32:22.0920538Z     ) -> None:
2025-05-07T20:32:22.0920630Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0920706Z     
2025-05-07T20:32:22.0920872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0920943Z     
2025-05-07T20:32:22.0921037Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0921157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0921244Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0921321Z         x0 = x[:, :D]
2025-05-07T20:32:22.0921394Z         x1 = x[:, D:]
2025-05-07T20:32:22.0921464Z     
2025-05-07T20:32:22.0921545Z         if contiguous:
2025-05-07T20:32:22.0921633Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0921723Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0921790Z     
2025-05-07T20:32:22.0921877Z         if scale_ub is not None:
2025-05-07T20:32:22.0922031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0922175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0922249Z             )
2025-05-07T20:32:22.0922322Z         else:
2025-05-07T20:32:22.0922412Z             scale_ub_tensor = None
2025-05-07T20:32:22.0922481Z     
2025-05-07T20:32:22.0922609Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0922696Z             op = silu_mul_quant
2025-05-07T20:32:22.0922775Z             if compiled:
2025-05-07T20:32:22.0922874Z                 op = torch.compile(op)
2025-05-07T20:32:22.0922975Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0923052Z     
2025-05-07T20:32:22.0923138Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0923142Z 
2025-05-07T20:32:22.0923235Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0923477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0923617Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0923776Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0924285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0924376Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0924738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0924957Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0925297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0925394Z     kernel = self.compile(
2025-05-07T20:32:22.0925779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0925953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0926081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0926129Z 
2025-05-07T20:32:22.0926331Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f130cdd0>
2025-05-07T20:32:22.0927098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0927590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f1647b00>}
2025-05-07T20:32:22.0928339Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0928526Z context = <triton._C.libtriton.ir.context object at 0x7f16f13f93f0>
2025-05-07T20:32:22.0928536Z 
2025-05-07T20:32:22.0928699Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0928969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0929070Z                            module_map=module_map)
2025-05-07T20:32:22.0929229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0929325Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0929396Z E       ^
2025-05-07T20:32:22.0929752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0929756Z 
2025-05-07T20:32:22.0930165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0930169Z 
2025-05-07T20:32:22.0930267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0930530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0930612Z     T=2048,
2025-05-07T20:32:22.0930688Z     D=5120,
2025-05-07T20:32:22.0930765Z     scale_ub=None,
2025-05-07T20:32:22.0930843Z     contiguous=True,
2025-05-07T20:32:22.0930923Z     compiled=False,
2025-05-07T20:32:22.0930991Z )
2025-05-07T20:32:22.0931204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0931378Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0931383Z 
2025-05-07T20:32:22.0931453Z     @given(
2025-05-07T20:32:22.0931568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0931664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0931773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0931887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0931994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0932103Z     )
2025-05-07T20:32:22.0932352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0932483Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0932555Z         self,
2025-05-07T20:32:22.0932628Z         T: int,
2025-05-07T20:32:22.0932699Z         D: int,
2025-05-07T20:32:22.0932793Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0932879Z         contiguous: bool,
2025-05-07T20:32:22.0932961Z         compiled: bool,
2025-05-07T20:32:22.0933032Z     ) -> None:
2025-05-07T20:32:22.0933126Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0933195Z     
2025-05-07T20:32:22.0933360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0933434Z     
2025-05-07T20:32:22.0933520Z >       x_sign = torch.sign(x)
2025-05-07T20:32:22.0935308Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0935358Z 
2025-05-07T20:32:22.0935471Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:22.0935476Z 
2025-05-07T20:32:22.0935576Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0935794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0935867Z     T=16384,
2025-05-07T20:32:22.0935943Z     D=5120,
2025-05-07T20:32:22.0936018Z     scale_ub=None,
2025-05-07T20:32:22.0936096Z     contiguous=True,
2025-05-07T20:32:22.0936178Z     compiled=False,
2025-05-07T20:32:22.0936246Z )
2025-05-07T20:32:22.0936458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0936642Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0936647Z 
2025-05-07T20:32:22.0936721Z     @given(
2025-05-07T20:32:22.0936838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0936930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0937037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0937151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0937258Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0937327Z     )
2025-05-07T20:32:22.0937568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0937657Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0937731Z         self,
2025-05-07T20:32:22.0937803Z         T: int,
2025-05-07T20:32:22.0937874Z         D: int,
2025-05-07T20:32:22.0937970Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0938053Z         contiguous: bool,
2025-05-07T20:32:22.0938180Z         compiled: bool,
2025-05-07T20:32:22.0938256Z     ) -> None:
2025-05-07T20:32:22.0938345Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0938601Z     
2025-05-07T20:32:22.0938840Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0940656Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0940663Z 
2025-05-07T20:32:22.0940868Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0941011Z 
2025-05-07T20:32:22.0941114Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0941336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0941409Z     T=4096,
2025-05-07T20:32:22.0941481Z     D=5120,
2025-05-07T20:32:22.0941564Z     scale_ub=None,
2025-05-07T20:32:22.0941641Z     contiguous=True,
2025-05-07T20:32:22.0941720Z     compiled=False,
2025-05-07T20:32:22.0941789Z )
2025-05-07T20:32:22.0941999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0942164Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0942168Z 
2025-05-07T20:32:22.0942244Z     @given(
2025-05-07T20:32:22.0942354Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0942446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0942561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0942675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0942855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0942924Z     )
2025-05-07T20:32:22.0943163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0943256Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0943328Z         self,
2025-05-07T20:32:22.0943397Z         T: int,
2025-05-07T20:32:22.0943475Z         D: int,
2025-05-07T20:32:22.0943567Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0943654Z         contiguous: bool,
2025-05-07T20:32:22.0943738Z         compiled: bool,
2025-05-07T20:32:22.0943811Z     ) -> None:
2025-05-07T20:32:22.0943900Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0943975Z     
2025-05-07T20:32:22.0944138Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0945912Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0945923Z 
2025-05-07T20:32:22.0946035Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0946040Z 
2025-05-07T20:32:22.0946137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0946353Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0946428Z     T=2048,
2025-05-07T20:32:22.0946503Z     D=5120,
2025-05-07T20:32:22.0946580Z     scale_ub=None,
2025-05-07T20:32:22.0946663Z     contiguous=False,
2025-05-07T20:32:22.0946747Z     compiled=False,
2025-05-07T20:32:22.0946816Z )
2025-05-07T20:32:22.0947091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0947269Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.0947274Z 
2025-05-07T20:32:22.0947346Z     @given(
2025-05-07T20:32:22.0947461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0947557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0947665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0947777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0947884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0947953Z     )
2025-05-07T20:32:22.0948193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0948281Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0948356Z         self,
2025-05-07T20:32:22.0948428Z         T: int,
2025-05-07T20:32:22.0948543Z         D: int,
2025-05-07T20:32:22.0948675Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0948761Z         contiguous: bool,
2025-05-07T20:32:22.0948840Z         compiled: bool,
2025-05-07T20:32:22.0948915Z     ) -> None:
2025-05-07T20:32:22.0949006Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0949074Z     
2025-05-07T20:32:22.0949243Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0951013Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0951019Z 
2025-05-07T20:32:22.0951142Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0951187Z 
2025-05-07T20:32:22.0951285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0951503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0951576Z     T=4096,
2025-05-07T20:32:22.0951648Z     D=7168,
2025-05-07T20:32:22.0951732Z     scale_ub=None,
2025-05-07T20:32:22.0951815Z     contiguous=True,
2025-05-07T20:32:22.0951890Z     compiled=True,
2025-05-07T20:32:22.0951962Z )
2025-05-07T20:32:22.0952171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0952335Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.0952339Z 
2025-05-07T20:32:22.0952415Z     @given(
2025-05-07T20:32:22.0952527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0952628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0952740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0952857Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0952965Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0953034Z     )
2025-05-07T20:32:22.0953271Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0953359Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0953435Z         self,
2025-05-07T20:32:22.0953507Z         T: int,
2025-05-07T20:32:22.0953582Z         D: int,
2025-05-07T20:32:22.0953675Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0953757Z         contiguous: bool,
2025-05-07T20:32:22.0953838Z         compiled: bool,
2025-05-07T20:32:22.0953911Z     ) -> None:
2025-05-07T20:32:22.0954000Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0954070Z     
2025-05-07T20:32:22.0954234Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0956055Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0956070Z 
2025-05-07T20:32:22.0956180Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0956185Z 
2025-05-07T20:32:22.0956282Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0956502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0956576Z     T=2048,
2025-05-07T20:32:22.0956650Z     D=5120,
2025-05-07T20:32:22.0956727Z     scale_ub=1200.0,
2025-05-07T20:32:22.0956880Z     contiguous=False,
2025-05-07T20:32:22.0957007Z     compiled=False,
2025-05-07T20:32:22.0957074Z )
2025-05-07T20:32:22.0957285Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0957460Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0957465Z 
2025-05-07T20:32:22.0957537Z     @given(
2025-05-07T20:32:22.0957646Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0957739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0957846Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0957958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0958064Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0958134Z     )
2025-05-07T20:32:22.0958375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0958462Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0958535Z         self,
2025-05-07T20:32:22.0958609Z         T: int,
2025-05-07T20:32:22.0958730Z         D: int,
2025-05-07T20:32:22.0958822Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0958906Z         contiguous: bool,
2025-05-07T20:32:22.0958987Z         compiled: bool,
2025-05-07T20:32:22.0959060Z     ) -> None:
2025-05-07T20:32:22.0959151Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0959217Z     
2025-05-07T20:32:22.0959381Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0961150Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0961161Z 
2025-05-07T20:32:22.0961278Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0961283Z 
2025-05-07T20:32:22.0961378Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0961593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0961671Z     T=4096,
2025-05-07T20:32:22.0961742Z     D=7168,
2025-05-07T20:32:22.0961816Z     scale_ub=1200.0,
2025-05-07T20:32:22.0961897Z     contiguous=True,
2025-05-07T20:32:22.0961972Z     compiled=False,
2025-05-07T20:32:22.0962041Z )
2025-05-07T20:32:22.0962257Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0962422Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.0962427Z 
2025-05-07T20:32:22.0962502Z     @given(
2025-05-07T20:32:22.0962614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0962752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0962868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0962976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0963081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0963156Z     )
2025-05-07T20:32:22.0963474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0963560Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0963634Z         self,
2025-05-07T20:32:22.0963706Z         T: int,
2025-05-07T20:32:22.0963778Z         D: int,
2025-05-07T20:32:22.0963870Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0963952Z         contiguous: bool,
2025-05-07T20:32:22.0964036Z         compiled: bool,
2025-05-07T20:32:22.0964108Z     ) -> None:
2025-05-07T20:32:22.0964196Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0964268Z     
2025-05-07T20:32:22.0964476Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0966292Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0966303Z 
2025-05-07T20:32:22.0966414Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0966419Z 
2025-05-07T20:32:22.0966515Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0966733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0966807Z     T=16384,
2025-05-07T20:32:22.0966884Z     D=7168,
2025-05-07T20:32:22.0966961Z     scale_ub=None,
2025-05-07T20:32:22.0967087Z     contiguous=False,
2025-05-07T20:32:22.0967167Z     compiled=True,
2025-05-07T20:32:22.0967238Z )
2025-05-07T20:32:22.0967447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0967621Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.0967626Z 
2025-05-07T20:32:22.0967699Z     @given(
2025-05-07T20:32:22.0967811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0967910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0968016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0968127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0968234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0968304Z     )
2025-05-07T20:32:22.0968547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0968635Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0968713Z         self,
2025-05-07T20:32:22.0968790Z         T: int,
2025-05-07T20:32:22.0968860Z         D: int,
2025-05-07T20:32:22.0968951Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0969036Z         contiguous: bool,
2025-05-07T20:32:22.0969116Z         compiled: bool,
2025-05-07T20:32:22.0969188Z     ) -> None:
2025-05-07T20:32:22.0969280Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0969349Z     
2025-05-07T20:32:22.0969512Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0971332Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0971343Z 
2025-05-07T20:32:22.0971458Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0971463Z 
2025-05-07T20:32:22.0971559Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0971776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0971854Z     T=4096,
2025-05-07T20:32:22.0971925Z     D=7168,
2025-05-07T20:32:22.0971999Z     scale_ub=None,
2025-05-07T20:32:22.0972080Z     contiguous=True,
2025-05-07T20:32:22.0972156Z     compiled=False,
2025-05-07T20:32:22.0972227Z )
2025-05-07T20:32:22.0972439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0972602Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0972606Z 
2025-05-07T20:32:22.0972721Z     @given(
2025-05-07T20:32:22.0972835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0972967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0973076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0973185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0973292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0973364Z     )
2025-05-07T20:32:22.0973601Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0973689Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0973764Z         self,
2025-05-07T20:32:22.0973833Z         T: int,
2025-05-07T20:32:22.0973905Z         D: int,
2025-05-07T20:32:22.0973996Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0974077Z         contiguous: bool,
2025-05-07T20:32:22.0974161Z         compiled: bool,
2025-05-07T20:32:22.0974233Z     ) -> None:
2025-05-07T20:32:22.0974321Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0974396Z     
2025-05-07T20:32:22.0974562Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0976378Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0976389Z 
2025-05-07T20:32:22.0976499Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0976503Z 
2025-05-07T20:32:22.0976601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0976823Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0976898Z     T=16384,
2025-05-07T20:32:22.0976980Z     D=7168,
2025-05-07T20:32:22.0977059Z     scale_ub=None,
2025-05-07T20:32:22.0977137Z     contiguous=True,
2025-05-07T20:32:22.0977217Z     compiled=False,
2025-05-07T20:32:22.0977286Z )
2025-05-07T20:32:22.0977494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0977666Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.0977670Z 
2025-05-07T20:32:22.0977743Z     @given(
2025-05-07T20:32:22.0977857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0977955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0978062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0978174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0978285Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0978355Z     )
2025-05-07T20:32:22.0978643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0978738Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0978811Z         self,
2025-05-07T20:32:22.0978889Z         T: int,
2025-05-07T20:32:22.0978961Z         D: int,
2025-05-07T20:32:22.0979055Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0979140Z         contiguous: bool,
2025-05-07T20:32:22.0979217Z         compiled: bool,
2025-05-07T20:32:22.0979290Z     ) -> None:
2025-05-07T20:32:22.0979381Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0979450Z     
2025-05-07T20:32:22.0979614Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0981486Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0981530Z 
2025-05-07T20:32:22.0981643Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0981647Z 
2025-05-07T20:32:22.0981743Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0981959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0982034Z     T=16384,
2025-05-07T20:32:22.0982108Z     D=7168,
2025-05-07T20:32:22.0982187Z     scale_ub=1200.0,
2025-05-07T20:32:22.0982271Z     contiguous=True,
2025-05-07T20:32:22.0982352Z     compiled=False,
2025-05-07T20:32:22.0982420Z )
2025-05-07T20:32:22.0982634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0982805Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.0982810Z 
2025-05-07T20:32:22.0982891Z     @given(
2025-05-07T20:32:22.0983044Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0983137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0983246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0983355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0983465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0983537Z     )
2025-05-07T20:32:22.0983775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0983863Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0983938Z         self,
2025-05-07T20:32:22.0984011Z         T: int,
2025-05-07T20:32:22.0984084Z         D: int,
2025-05-07T20:32:22.0984180Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0984263Z         contiguous: bool,
2025-05-07T20:32:22.0984345Z         compiled: bool,
2025-05-07T20:32:22.0984420Z     ) -> None:
2025-05-07T20:32:22.0984513Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0984592Z     
2025-05-07T20:32:22.0984756Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0986526Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.0986537Z 
2025-05-07T20:32:22.0986648Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.0986652Z 
2025-05-07T20:32:22.0986751Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0987040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0987116Z     T=128,
2025-05-07T20:32:22.0987193Z     D=5120,
2025-05-07T20:32:22.0987269Z     scale_ub=1200.0,
2025-05-07T20:32:22.0987348Z     contiguous=False,
2025-05-07T20:32:22.0987428Z     compiled=False,
2025-05-07T20:32:22.0987496Z )
2025-05-07T20:32:22.0987706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0987877Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.0987881Z 
2025-05-07T20:32:22.0987949Z     @given(
2025-05-07T20:32:22.0988059Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0988154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0988259Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0988377Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0988526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0988598Z     )
2025-05-07T20:32:22.0988880Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0988971Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0989045Z         self,
2025-05-07T20:32:22.0989118Z         T: int,
2025-05-07T20:32:22.0989189Z         D: int,
2025-05-07T20:32:22.0989279Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0989365Z         contiguous: bool,
2025-05-07T20:32:22.0989444Z         compiled: bool,
2025-05-07T20:32:22.0989515Z     ) -> None:
2025-05-07T20:32:22.0989609Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0989673Z     
2025-05-07T20:32:22.0989837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0989910Z     
2025-05-07T20:32:22.0989997Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0990120Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0990203Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0990283Z         x0 = x[:, :D]
2025-05-07T20:32:22.0990364Z         x1 = x[:, D:]
2025-05-07T20:32:22.0990476Z     
2025-05-07T20:32:22.0990554Z         if contiguous:
2025-05-07T20:32:22.0990642Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0990728Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0990797Z     
2025-05-07T20:32:22.0990887Z         if scale_ub is not None:
2025-05-07T20:32:22.0990987Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0991141Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0991218Z             )
2025-05-07T20:32:22.0991310Z         else:
2025-05-07T20:32:22.0991404Z             scale_ub_tensor = None
2025-05-07T20:32:22.0991472Z     
2025-05-07T20:32:22.0991596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0991682Z             op = silu_mul_quant
2025-05-07T20:32:22.0991761Z             if compiled:
2025-05-07T20:32:22.0991852Z                 op = torch.compile(op)
2025-05-07T20:32:22.0991957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0992028Z     
2025-05-07T20:32:22.0992117Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0992121Z 
2025-05-07T20:32:22.0992215Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0992340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0992438Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0992534Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0993032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0993123Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0993482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0993699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0994091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0994185Z     kernel = self.compile(
2025-05-07T20:32:22.0994570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0994740Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0994861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0994866Z 
2025-05-07T20:32:22.0995072Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f118bb90>
2025-05-07T20:32:22.0995841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0996375Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f146e700>}
2025-05-07T20:32:22.0997168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0997354Z context = <triton._C.libtriton.ir.context object at 0x7f16f116bef0>
2025-05-07T20:32:22.0997362Z 
2025-05-07T20:32:22.0997525Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0997789Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0997892Z                            module_map=module_map)
2025-05-07T20:32:22.0998047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0998139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0998213Z E       ^
2025-05-07T20:32:22.0998568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0998618Z 
2025-05-07T20:32:22.0999037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0999042Z 
2025-05-07T20:32:22.0999142Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0999362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0999434Z     T=2048,
2025-05-07T20:32:22.0999502Z     D=7168,
2025-05-07T20:32:22.0999577Z     scale_ub=None,
2025-05-07T20:32:22.0999660Z     contiguous=False,
2025-05-07T20:32:22.0999742Z     compiled=False,
2025-05-07T20:32:22.0999811Z )
2025-05-07T20:32:22.1000028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.1000197Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.1000202Z 
2025-05-07T20:32:22.1000279Z     @given(
2025-05-07T20:32:22.1000395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.1000492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.1000607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.1000716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.1000825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.1000897Z     )
2025-05-07T20:32:22.1001136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.1001223Z     def test_silu_mul_quant(
2025-05-07T20:32:22.1001294Z         self,
2025-05-07T20:32:22.1001364Z         T: int,
2025-05-07T20:32:22.1001439Z         D: int,
2025-05-07T20:32:22.1001534Z         scale_ub: Optional[float],
2025-05-07T20:32:22.1001616Z         contiguous: bool,
2025-05-07T20:32:22.1001706Z         compiled: bool,
2025-05-07T20:32:22.1001780Z     ) -> None:
2025-05-07T20:32:22.1001869Z         torch.manual_seed(2025)
2025-05-07T20:32:22.1001939Z     
2025-05-07T20:32:22.1002152Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.1004054Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.1004060Z 
2025-05-07T20:32:22.1004172Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.1004177Z 
2025-05-07T20:32:22.1004275Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.1004495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.1004611Z     T=128,
2025-05-07T20:32:22.1004687Z     D=7168,
2025-05-07T20:32:22.1004805Z     scale_ub=1200.0,
2025-05-07T20:32:22.1004883Z     contiguous=True,
2025-05-07T20:32:22.1004963Z     compiled=True,
2025-05-07T20:32:22.1005029Z )
2025-05-07T20:32:22.1005240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.1005406Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.1005410Z 
2025-05-07T20:32:22.1005481Z     @given(
2025-05-07T20:32:22.1005592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.1005686Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.1005792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.1005903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.1006010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.1006081Z     )
2025-05-07T20:32:22.1006327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.1006419Z     def test_silu_mul_quant(
2025-05-07T20:32:22.1006535Z         self,
2025-05-07T20:32:22.1006611Z         T: int,
2025-05-07T20:32:22.1006684Z         D: int,
2025-05-07T20:32:22.1006777Z         scale_ub: Optional[float],
2025-05-07T20:32:22.1006862Z         contiguous: bool,
2025-05-07T20:32:22.1006943Z         compiled: bool,
2025-05-07T20:32:22.1007016Z     ) -> None:
2025-05-07T20:32:22.1007109Z         torch.manual_seed(2025)
2025-05-07T20:32:22.1007177Z     
2025-05-07T20:32:22.1007344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.1007414Z     
2025-05-07T20:32:22.1007501Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.1007625Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.1007709Z         x = x_sign * x_clamp
2025-05-07T20:32:22.1007787Z         x0 = x[:, :D]
2025-05-07T20:32:22.1007863Z         x1 = x[:, D:]
2025-05-07T20:32:22.1007928Z     
2025-05-07T20:32:22.1008009Z         if contiguous:
2025-05-07T20:32:22.1008106Z             x0 = x0.contiguous()
2025-05-07T20:32:22.1008193Z             x1 = x1.contiguous()
2025-05-07T20:32:22.1008263Z     
2025-05-07T20:32:22.1008352Z         if scale_ub is not None:
2025-05-07T20:32:22.1008452Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.1008583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.1008655Z             )
2025-05-07T20:32:22.1008727Z         else:
2025-05-07T20:32:22.1008819Z             scale_ub_tensor = None
2025-05-07T20:32:22.1008886Z     
2025-05-07T20:32:22.1009010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.1009098Z             op = silu_mul_quant
2025-05-07T20:32:22.1009177Z             if compiled:
2025-05-07T20:32:22.1009270Z                 op = torch.compile(op)
2025-05-07T20:32:22.1009374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.1009443Z     
2025-05-07T20:32:22.1009529Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.1009533Z 
2025-05-07T20:32:22.1009678Z moe/activation_test.py:117: 
2025-05-07T20:32:22.1009807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.1009907Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.1013167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.1013569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.1013664Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.1014167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.1014261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.1014620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.1014910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.1015256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.1015390Z     kernel = self.compile(
2025-05-07T20:32:22.1015778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.1015949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.1016076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.1016081Z 
2025-05-07T20:32:22.1016284Z self = <triton.compiler.compiler.ASTSource object at 0x7f16f124e650>
2025-05-07T20:32:22.1017052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.1017556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f190e348540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16f146ff60>}
2025-05-07T20:32:22.1018372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.1018561Z context = <triton._C.libtriton.ir.context object at 0x7f16f123f270>
2025-05-07T20:32:22.1018566Z 
2025-05-07T20:32:22.1018727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.1018994Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.1019095Z                            module_map=module_map)
2025-05-07T20:32:22.1019255Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.1019356Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.1019428Z E       ^
2025-05-07T20:32:22.1019786Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.1019795Z 
2025-05-07T20:32:22.1020215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.1020219Z 
2025-05-07T20:32:22.1020319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.1020541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.1020612Z     T=128,
2025-05-07T20:32:22.1020687Z     D=7168,
2025-05-07T20:32:22.1020770Z     scale_ub=1200.0,
2025-05-07T20:32:22.1020850Z     contiguous=True,
2025-05-07T20:32:22.1020930Z     compiled=False,
2025-05-07T20:32:22.1021006Z )
2025-05-07T20:32:22.1021218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.1021387Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.1021395Z 
2025-05-07T20:32:22.1021473Z     @given(
2025-05-07T20:32:22.1021636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.1021737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.1021846Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.1021958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.1022069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.1022142Z     )
2025-05-07T20:32:22.1022382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.1022473Z     def test_silu_mul_quant(
2025-05-07T20:32:22.1022547Z         self,
2025-05-07T20:32:22.1022619Z         T: int,
2025-05-07T20:32:22.1022697Z         D: int,
2025-05-07T20:32:22.1022789Z         scale_ub: Optional[float],
2025-05-07T20:32:22.1022876Z         contiguous: bool,
2025-05-07T20:32:22.1022956Z         compiled: bool,
2025-05-07T20:32:22.1023031Z     ) -> None:
2025-05-07T20:32:22.1023169Z         torch.manual_seed(2025)
2025-05-07T20:32:22.1023237Z     
2025-05-07T20:32:22.1023447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.1023524Z     
2025-05-07T20:32:22.1023614Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.1023735Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.1025509Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.1025515Z 
2025-05-07T20:32:22.1025630Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:22.1025635Z 
2025-05-07T20:32:22.1025742Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.1026004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.1026082Z     T=128,
2025-05-07T20:32:22.1026156Z     D=5120,
2025-05-07T20:32:22.1026235Z     scale_ub=1200.0,
2025-05-07T20:32:22.1026320Z     contiguous=True,
2025-05-07T20:32:22.1026397Z     compiled=True,
2025-05-07T20:32:22.1026467Z )
2025-05-07T20:32:22.1026684Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.1026850Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.1026854Z 
2025-05-07T20:32:22.1026927Z     @given(
2025-05-07T20:32:22.1027043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.1027136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.1027246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.1027358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.1027473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.1027545Z     )
2025-05-07T20:32:22.1027784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.1027872Z     def test_silu_mul_quant(
2025-05-07T20:32:22.1027949Z         self,
2025-05-07T20:32:22.1028023Z         T: int,
2025-05-07T20:32:22.1028097Z         D: int,
2025-05-07T20:32:22.1028192Z         scale_ub: Optional[float],
2025-05-07T20:32:22.1028278Z         contiguous: bool,
2025-05-07T20:32:22.1028359Z         compiled: bool,
2025-05-07T20:32:22.1028436Z     ) -> None:
2025-05-07T20:32:22.1028527Z         torch.manual_seed(2025)
2025-05-07T20:32:22.1028596Z     
2025-05-07T20:32:22.1028762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.1028831Z     
2025-05-07T20:32:22.1028921Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.1029044Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.1030856Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.1030869Z 
2025-05-07T20:32:22.1030983Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:22.1030988Z 
2025-05-07T20:32:22.1031086Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.1031310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.1031384Z     T=128,
2025-05-07T20:32:22.1031492Z     D=7168,
2025-05-07T20:32:22.1031579Z     scale_ub=None,
2025-05-07T20:32:22.1031704Z     contiguous=True,
2025-05-07T20:32:22.1031790Z     compiled=True,
2025-05-07T20:32:22.1031862Z )
2025-05-07T20:32:22.1032074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.1032240Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.1032245Z 
2025-05-07T20:32:22.1032317Z     @given(
2025-05-07T20:32:22.1032431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.1032526Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.1032634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.1032746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.1032857Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.1032929Z     )
2025-05-07T20:32:22.1033170Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.1033261Z     def test_silu_mul_quant(
2025-05-07T20:32:22.1033338Z         self,
2025-05-07T20:32:22.1033461Z         T: int,
2025-05-07T20:32:22.1033533Z         D: int,
2025-05-07T20:32:22.1033626Z         scale_ub: Optional[float],
2025-05-07T20:32:22.1033715Z         contiguous: bool,
2025-05-07T20:32:22.1033799Z         compiled: bool,
2025-05-07T20:32:22.1033872Z     ) -> None:
2025-05-07T20:32:22.1033963Z         torch.manual_seed(2025)
2025-05-07T20:32:22.1034029Z     
2025-05-07T20:32:22.1034194Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.1035962Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:22.1035973Z 
2025-05-07T20:32:22.1036086Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:22.1036217Z =============================== warnings summary ===============================
2025-05-07T20:32:22.1036520Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:22.1036824Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:22.1037118Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:22.1037985Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:22.1038260Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:22.1038268Z 
2025-05-07T20:32:22.1038761Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:22.1038995Z ================= 1 failed, 1 deselected, 3 warnings in 14.16s =================
2025-05-07T20:32:23.7176622Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:23.7827812Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:23.7828159Z 
2025-05-07T20:32:25.7848617Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:27.9536093Z ============================= test session starts ==============================
2025-05-07T20:32:27.9536834Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:27.9537354Z cachedir: .pytest_cache
2025-05-07T20:32:27.9537924Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:27.9539012Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:27.9539426Z plugins: hypothesis-6.131.14
2025-05-07T20:32:29.5749891Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:29.7265700Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:29.7266510Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:29.7266939Z 
2025-05-07T20:32:32.1390891Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.1392232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.1393461Z     T=1,
2025-05-07T20:32:32.1393826Z     D=5120,
2025-05-07T20:32:32.1394203Z     scale_ub=None,
2025-05-07T20:32:32.1394503Z     contiguous=True,
2025-05-07T20:32:32.1394724Z     compiled=True,
2025-05-07T20:32:32.1394928Z )
2025-05-07T20:32:32.1395253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.1395747Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.1396014Z 
2025-05-07T20:32:32.1396093Z     @given(
2025-05-07T20:32:32.1396326Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.1396644Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.1396948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.1397284Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.1397619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.1397898Z     )
2025-05-07T20:32:32.1398258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.1398707Z     def test_silu_mul_quant(
2025-05-07T20:32:32.1398955Z         self,
2025-05-07T20:32:32.1399145Z         T: int,
2025-05-07T20:32:32.1399349Z         D: int,
2025-05-07T20:32:32.1399572Z         scale_ub: Optional[float],
2025-05-07T20:32:32.1399842Z         contiguous: bool,
2025-05-07T20:32:32.1400082Z         compiled: bool,
2025-05-07T20:32:32.1400311Z     ) -> None:
2025-05-07T20:32:32.1400523Z         torch.manual_seed(2025)
2025-05-07T20:32:32.1400766Z     
2025-05-07T20:32:32.1401046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.1401384Z     
2025-05-07T20:32:32.1401585Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.1401879Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.1402186Z         x = x_sign * x_clamp
2025-05-07T20:32:32.1402432Z         x0 = x[:, :D]
2025-05-07T20:32:32.1402652Z         x1 = x[:, D:]
2025-05-07T20:32:32.1402956Z     
2025-05-07T20:32:32.1403149Z         if contiguous:
2025-05-07T20:32:32.1403388Z             x0 = x0.contiguous()
2025-05-07T20:32:32.1403769Z             x1 = x1.contiguous()
2025-05-07T20:32:32.1404024Z     
2025-05-07T20:32:32.1404221Z         if scale_ub is not None:
2025-05-07T20:32:32.1404498Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.1404831Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.1405147Z             )
2025-05-07T20:32:32.1405341Z         else:
2025-05-07T20:32:32.1405547Z             scale_ub_tensor = None
2025-05-07T20:32:32.1405805Z     
2025-05-07T20:32:32.1406044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.1406354Z             op = silu_mul_quant
2025-05-07T20:32:32.1406610Z             if compiled:
2025-05-07T20:32:32.1406861Z                 op = torch.compile(op)
2025-05-07T20:32:32.1407246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1407620Z     
2025-05-07T20:32:32.1407818Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.1408100Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.1408394Z     
2025-05-07T20:32:32.1408636Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.1408972Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.1409270Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.1409585Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.1409947Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.1410255Z     
2025-05-07T20:32:32.1410460Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.1410653Z 
2025-05-07T20:32:32.1410763Z moe/activation_test.py:126: 
2025-05-07T20:32:32.1411055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1411398Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.1411732Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.1412583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.1413345Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.1413895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.1414585Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.1415273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.1416005Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.1416767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:32.1417523Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.1418258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.1418903Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.1419525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.1420053Z     fn()
2025-05-07T20:32:32.1420561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.1421153Z     self.fn.run(
2025-05-07T20:32:32.1421624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.1422155Z     kernel = self.compile(
2025-05-07T20:32:32.1422708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.1423425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1423830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1424059Z 
2025-05-07T20:32:32.1424268Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3b0d4bf10>
2025-05-07T20:32:32.1425353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.1426744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3ab33d3a0>}
2025-05-07T20:32:32.1428137Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.1429209Z context = <triton._C.libtriton.ir.context object at 0x7fb3b0d933b0>
2025-05-07T20:32:32.1429502Z 
2025-05-07T20:32:32.1429671Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.1430198Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1430676Z                            module_map=module_map)
2025-05-07T20:32:32.1431045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1431405Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.1431674Z E       ^
2025-05-07T20:32:32.1432133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.1432592Z 
2025-05-07T20:32:32.1433014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.1433540Z 
2025-05-07T20:32:32.1433651Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.1434109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.1434505Z     T=2048,
2025-05-07T20:32:32.1434696Z     D=5120,
2025-05-07T20:32:32.1434886Z     scale_ub=1200.0,
2025-05-07T20:32:32.1435104Z     contiguous=True,
2025-05-07T20:32:32.1435327Z     compiled=False,
2025-05-07T20:32:32.1435541Z )
2025-05-07T20:32:33.0991594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.0993240Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:33.0994009Z 
2025-05-07T20:32:33.0994248Z     @given(
2025-05-07T20:32:33.0994633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.0994995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.0995308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.0995658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.0996009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.0996304Z     )
2025-05-07T20:32:33.0996662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.0997105Z     def test_silu_mul_quant(
2025-05-07T20:32:33.0997350Z         self,
2025-05-07T20:32:33.0997551Z         T: int,
2025-05-07T20:32:33.0997744Z         D: int,
2025-05-07T20:32:33.0997968Z         scale_ub: Optional[float],
2025-05-07T20:32:33.0998246Z         contiguous: bool,
2025-05-07T20:32:33.0998484Z         compiled: bool,
2025-05-07T20:32:33.0998719Z     ) -> None:
2025-05-07T20:32:33.0998936Z         torch.manual_seed(2025)
2025-05-07T20:32:33.0999176Z     
2025-05-07T20:32:33.0999458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.0999809Z     
2025-05-07T20:32:33.1000000Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1000303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1000939Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1001200Z         x0 = x[:, :D]
2025-05-07T20:32:33.1001417Z         x1 = x[:, D:]
2025-05-07T20:32:33.1001629Z     
2025-05-07T20:32:33.1001824Z         if contiguous:
2025-05-07T20:32:33.1002055Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1002318Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1002565Z     
2025-05-07T20:32:33.1002756Z         if scale_ub is not None:
2025-05-07T20:32:33.1003037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1003382Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1003847Z             )
2025-05-07T20:32:33.1004045Z         else:
2025-05-07T20:32:33.1004262Z             scale_ub_tensor = None
2025-05-07T20:32:33.1004514Z     
2025-05-07T20:32:33.1004748Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1005063Z             op = silu_mul_quant
2025-05-07T20:32:33.1005410Z             if compiled:
2025-05-07T20:32:33.1005667Z                 op = torch.compile(op)
2025-05-07T20:32:33.1006050Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1006324Z     
2025-05-07T20:32:33.1006521Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.1006693Z 
2025-05-07T20:32:33.1006796Z moe/activation_test.py:117: 
2025-05-07T20:32:33.1007094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1007426Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.1007713Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1008415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.1009113Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.1009659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1010358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1011128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1011667Z     kernel = self.compile(
2025-05-07T20:32:33.1012221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1012886Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1013287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1013518Z 
2025-05-07T20:32:33.1013728Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3ab7c0790>
2025-05-07T20:32:33.1014867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1016261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3ab1ec2c0>}
2025-05-07T20:32:33.1017618Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1018651Z context = <triton._C.libtriton.ir.context object at 0x7fb3ab76e6f0>
2025-05-07T20:32:33.1018950Z 
2025-05-07T20:32:33.1019119Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.1019649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.1020130Z                            module_map=module_map)
2025-05-07T20:32:33.1020493Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.1020854Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.1021125Z E       ^
2025-05-07T20:32:33.1021645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.1022116Z 
2025-05-07T20:32:33.1022539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.1023067Z 
2025-05-07T20:32:33.1023175Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1023593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1024001Z     T=2048,
2025-05-07T20:32:33.1024197Z     D=5120,
2025-05-07T20:32:33.1024398Z     scale_ub=1200.0,
2025-05-07T20:32:33.1024624Z     contiguous=True,
2025-05-07T20:32:33.1024855Z     compiled=True,
2025-05-07T20:32:33.1025071Z )
2025-05-07T20:32:33.1025392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.1025935Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.1026219Z 
2025-05-07T20:32:33.1026340Z     @given(
2025-05-07T20:32:33.1026580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.1026892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.1027202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.1027538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.1027864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.1028158Z     )
2025-05-07T20:32:33.1028513Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.1028955Z     def test_silu_mul_quant(
2025-05-07T20:32:33.1029202Z         self,
2025-05-07T20:32:33.1029402Z         T: int,
2025-05-07T20:32:33.1029595Z         D: int,
2025-05-07T20:32:33.1029818Z         scale_ub: Optional[float],
2025-05-07T20:32:33.1030093Z         contiguous: bool,
2025-05-07T20:32:33.1030340Z         compiled: bool,
2025-05-07T20:32:33.1030561Z     ) -> None:
2025-05-07T20:32:33.1030785Z         torch.manual_seed(2025)
2025-05-07T20:32:33.1031089Z     
2025-05-07T20:32:33.1031362Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.1031708Z     
2025-05-07T20:32:33.1031906Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1032194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1032508Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1032756Z         x0 = x[:, :D]
2025-05-07T20:32:33.1032970Z         x1 = x[:, D:]
2025-05-07T20:32:33.1033187Z     
2025-05-07T20:32:33.1033379Z         if contiguous:
2025-05-07T20:32:33.1033610Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1033875Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1034126Z     
2025-05-07T20:32:33.1034316Z         if scale_ub is not None:
2025-05-07T20:32:33.1034596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1034938Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1035260Z             )
2025-05-07T20:32:33.1035460Z         else:
2025-05-07T20:32:33.1035683Z             scale_ub_tensor = None
2025-05-07T20:32:33.1035944Z     
2025-05-07T20:32:33.1036173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1036497Z             op = silu_mul_quant
2025-05-07T20:32:33.1036754Z             if compiled:
2025-05-07T20:32:33.1036999Z                 op = torch.compile(op)
2025-05-07T20:32:33.1037305Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1037591Z     
2025-05-07T20:32:33.1037782Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.1038073Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.1038697Z     
2025-05-07T20:32:33.1038944Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1039289Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.1039589Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.1039909Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.1040345Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.1040673Z     
2025-05-07T20:32:33.1040880Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.1041076Z 
2025-05-07T20:32:33.1041176Z moe/activation_test.py:126: 
2025-05-07T20:32:33.1041475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1041814Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.1042138Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.1042934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.1043814Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.1044389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1045180Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1045938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.1046679Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.1047445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:33.1048195Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.1048938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.1049590Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.1050199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.1050734Z     fn()
2025-05-07T20:32:33.1051259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.1051918Z     self.fn.run(
2025-05-07T20:32:33.1052389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1059714Z     kernel = self.compile(
2025-05-07T20:32:33.1060318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1060997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1061399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1061640Z 
2025-05-07T20:32:33.1061850Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3aa101910>
2025-05-07T20:32:33.1062949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1064334Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3aa0eb880>}
2025-05-07T20:32:33.1065678Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1066716Z context = <triton._C.libtriton.ir.context object at 0x7fb3aa1058b0>
2025-05-07T20:32:33.1067017Z 
2025-05-07T20:32:33.1067189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.1067724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.1068188Z                            module_map=module_map)
2025-05-07T20:32:33.1068565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.1069002Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.1069273Z E       ^
2025-05-07T20:32:33.1069745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.1070206Z 
2025-05-07T20:32:33.1070628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.1071142Z 
2025-05-07T20:32:33.1071252Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1071662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1072070Z     T=16384,
2025-05-07T20:32:33.1072270Z     D=7168,
2025-05-07T20:32:33.1072464Z     scale_ub=1200.0,
2025-05-07T20:32:33.1072693Z     contiguous=False,
2025-05-07T20:32:33.1072924Z     compiled=False,
2025-05-07T20:32:33.1073132Z )
2025-05-07T20:32:33.9301293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.9302195Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.9302594Z 
2025-05-07T20:32:33.9302697Z     @given(
2025-05-07T20:32:33.9303006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.9303420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.9303788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.9304133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.9304471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.9304770Z     )
2025-05-07T20:32:33.9305124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.9305580Z     def test_silu_mul_quant(
2025-05-07T20:32:33.9305833Z         self,
2025-05-07T20:32:33.9306032Z         T: int,
2025-05-07T20:32:33.9306242Z         D: int,
2025-05-07T20:32:33.9306472Z         scale_ub: Optional[float],
2025-05-07T20:32:33.9306751Z         contiguous: bool,
2025-05-07T20:32:33.9307008Z         compiled: bool,
2025-05-07T20:32:33.9307356Z     ) -> None:
2025-05-07T20:32:33.9307574Z         torch.manual_seed(2025)
2025-05-07T20:32:33.9307827Z     
2025-05-07T20:32:33.9308119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.9308462Z     
2025-05-07T20:32:33.9308667Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.9308967Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.9309284Z         x = x_sign * x_clamp
2025-05-07T20:32:33.9309525Z         x0 = x[:, :D]
2025-05-07T20:32:33.9309753Z         x1 = x[:, D:]
2025-05-07T20:32:33.9309998Z     
2025-05-07T20:32:33.9310193Z         if contiguous:
2025-05-07T20:32:33.9310429Z             x0 = x0.contiguous()
2025-05-07T20:32:33.9310697Z             x1 = x1.contiguous()
2025-05-07T20:32:33.9310950Z     
2025-05-07T20:32:33.9311155Z         if scale_ub is not None:
2025-05-07T20:32:33.9311442Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.9311788Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.9312113Z             )
2025-05-07T20:32:33.9312306Z         else:
2025-05-07T20:32:33.9312528Z             scale_ub_tensor = None
2025-05-07T20:32:33.9312796Z     
2025-05-07T20:32:33.9313027Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9313351Z             op = silu_mul_quant
2025-05-07T20:32:33.9313612Z             if compiled:
2025-05-07T20:32:33.9313859Z                 op = torch.compile(op)
2025-05-07T20:32:33.9314163Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9314448Z     
2025-05-07T20:32:33.9314648Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.9314828Z 
2025-05-07T20:32:33.9314933Z moe/activation_test.py:117: 
2025-05-07T20:32:33.9315230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9315563Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.9315858Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9316661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.9317370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.9317917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.9318616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.9319298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.9319837Z     kernel = self.compile(
2025-05-07T20:32:33.9320393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.9321059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.9321510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9321781Z 
2025-05-07T20:32:33.9321994Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a9da9c10>
2025-05-07T20:32:33.9323079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.9324605Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9e23380>}
2025-05-07T20:32:33.9326007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.9327039Z context = <triton._C.libtriton.ir.context object at 0x7fb3a9dd2270>
2025-05-07T20:32:33.9327330Z 
2025-05-07T20:32:33.9327504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.9328082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.9328556Z                            module_map=module_map)
2025-05-07T20:32:33.9328920Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.9329282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.9329549Z E       ^
2025-05-07T20:32:33.9330010Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9330472Z 
2025-05-07T20:32:33.9330890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9331413Z 
2025-05-07T20:32:33.9331519Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9331935Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9332331Z     T=1,
2025-05-07T20:32:33.9332519Z     D=7168,
2025-05-07T20:32:33.9332720Z     scale_ub=None,
2025-05-07T20:32:33.9332933Z     contiguous=True,
2025-05-07T20:32:33.9333162Z     compiled=True,
2025-05-07T20:32:33.9333373Z )
2025-05-07T20:32:33.9333697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.9334174Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:33.9334440Z 
2025-05-07T20:32:33.9334518Z     @given(
2025-05-07T20:32:33.9334750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.9335065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.9335372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.9335704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.9336032Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.9336325Z     )
2025-05-07T20:32:33.9336682Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.9337181Z     def test_silu_mul_quant(
2025-05-07T20:32:33.9337421Z         self,
2025-05-07T20:32:33.9337622Z         T: int,
2025-05-07T20:32:33.9337823Z         D: int,
2025-05-07T20:32:33.9338038Z         scale_ub: Optional[float],
2025-05-07T20:32:33.9338320Z         contiguous: bool,
2025-05-07T20:32:33.9338854Z         compiled: bool,
2025-05-07T20:32:33.9339076Z     ) -> None:
2025-05-07T20:32:33.9339297Z         torch.manual_seed(2025)
2025-05-07T20:32:33.9339544Z     
2025-05-07T20:32:33.9339816Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.9340165Z     
2025-05-07T20:32:33.9340363Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.9340647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.9340958Z         x = x_sign * x_clamp
2025-05-07T20:32:33.9341198Z         x0 = x[:, :D]
2025-05-07T20:32:33.9341410Z         x1 = x[:, D:]
2025-05-07T20:32:33.9341693Z     
2025-05-07T20:32:33.9341883Z         if contiguous:
2025-05-07T20:32:33.9342176Z             x0 = x0.contiguous()
2025-05-07T20:32:33.9342440Z             x1 = x1.contiguous()
2025-05-07T20:32:33.9342679Z     
2025-05-07T20:32:33.9342871Z         if scale_ub is not None:
2025-05-07T20:32:33.9343146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.9343488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.9343806Z             )
2025-05-07T20:32:33.9343997Z         else:
2025-05-07T20:32:33.9344209Z             scale_ub_tensor = None
2025-05-07T20:32:33.9344465Z     
2025-05-07T20:32:33.9344696Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9345015Z             op = silu_mul_quant
2025-05-07T20:32:33.9345270Z             if compiled:
2025-05-07T20:32:33.9345510Z                 op = torch.compile(op)
2025-05-07T20:32:33.9345814Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9346092Z     
2025-05-07T20:32:33.9346285Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.9346583Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.9346953Z     
2025-05-07T20:32:33.9347188Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9347531Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.9347830Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.9348153Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.9348513Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.9348827Z     
2025-05-07T20:32:33.9349035Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.9349229Z 
2025-05-07T20:32:33.9349330Z moe/activation_test.py:126: 
2025-05-07T20:32:33.9349631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9349970Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.9350299Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.9351101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.9351867Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.9352424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.9353107Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.9353805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.9354547Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.9355362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:33.9356113Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.9356951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.9357606Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.9358214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.9358731Z     fn()
2025-05-07T20:32:33.9359244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.9359833Z     self.fn.run(
2025-05-07T20:32:33.9360297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.9360832Z     kernel = self.compile(
2025-05-07T20:32:33.9361380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.9362089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.9362542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9362775Z 
2025-05-07T20:32:33.9362999Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a97f9910>
2025-05-07T20:32:33.9364213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.9365581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9e979c0>}
2025-05-07T20:32:33.9366934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.9367977Z context = <triton._C.libtriton.ir.context object at 0x7fb3a9af5470>
2025-05-07T20:32:33.9368328Z 
2025-05-07T20:32:33.9368506Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.9369028Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.9369502Z                            module_map=module_map)
2025-05-07T20:32:33.9369870Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.9370233Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.9370500Z E       ^
2025-05-07T20:32:33.9370971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9371423Z 
2025-05-07T20:32:33.9371858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9372374Z 
2025-05-07T20:32:33.9372490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9372903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9373316Z     T=4096,
2025-05-07T20:32:33.9373510Z     D=5120,
2025-05-07T20:32:33.9373698Z     scale_ub=None,
2025-05-07T20:32:33.9373918Z     contiguous=False,
2025-05-07T20:32:33.9374151Z     compiled=False,
2025-05-07T20:32:33.9374351Z )
2025-05-07T20:32:34.8708096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8708832Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.8709219Z 
2025-05-07T20:32:34.8709335Z     @given(
2025-05-07T20:32:34.8709616Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8709931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8710247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8710592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8710947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8711241Z     )
2025-05-07T20:32:34.8711919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8712367Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8712606Z         self,
2025-05-07T20:32:34.8712803Z         T: int,
2025-05-07T20:32:34.8713008Z         D: int,
2025-05-07T20:32:34.8713224Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8713504Z         contiguous: bool,
2025-05-07T20:32:34.8713750Z         compiled: bool,
2025-05-07T20:32:34.8713976Z     ) -> None:
2025-05-07T20:32:34.8714199Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8714446Z     
2025-05-07T20:32:34.8714718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8715061Z     
2025-05-07T20:32:34.8715254Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.8715543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.8715985Z         x = x_sign * x_clamp
2025-05-07T20:32:34.8716252Z         x0 = x[:, :D]
2025-05-07T20:32:34.8716557Z         x1 = x[:, D:]
2025-05-07T20:32:34.8716768Z     
2025-05-07T20:32:34.8716952Z         if contiguous:
2025-05-07T20:32:34.8717195Z             x0 = x0.contiguous()
2025-05-07T20:32:34.8717464Z             x1 = x1.contiguous()
2025-05-07T20:32:34.8717704Z     
2025-05-07T20:32:34.8717903Z         if scale_ub is not None:
2025-05-07T20:32:34.8718181Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.8718518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.8718825Z             )
2025-05-07T20:32:34.8719026Z         else:
2025-05-07T20:32:34.8719243Z             scale_ub_tensor = None
2025-05-07T20:32:34.8719494Z     
2025-05-07T20:32:34.8719733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.8720049Z             op = silu_mul_quant
2025-05-07T20:32:34.8720295Z             if compiled:
2025-05-07T20:32:34.8720552Z                 op = torch.compile(op)
2025-05-07T20:32:34.8720858Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8721235Z     
2025-05-07T20:32:34.8721440Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.8721605Z 
2025-05-07T20:32:34.8721717Z moe/activation_test.py:117: 
2025-05-07T20:32:34.8722011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8722347Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.8722634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8723329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.8724244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.8724790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.8725481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.8726159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.8726704Z     kernel = self.compile(
2025-05-07T20:32:34.8727256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.8727927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.8728326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8728563Z 
2025-05-07T20:32:34.8728773Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a99dd350>
2025-05-07T20:32:34.8729858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.8731304Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9e78180>}
2025-05-07T20:32:34.8732665Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.8733693Z context = <triton._C.libtriton.ir.context object at 0x7fb3a99e19b0>
2025-05-07T20:32:34.8733990Z 
2025-05-07T20:32:34.8734158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.8734688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.8735159Z                            module_map=module_map)
2025-05-07T20:32:34.8735526Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.8735888Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.8736156Z E       ^
2025-05-07T20:32:34.8736669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.8737177Z 
2025-05-07T20:32:34.8737600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.8738122Z 
2025-05-07T20:32:34.8738230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8738974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8739375Z     T=4096,
2025-05-07T20:32:34.8739570Z     D=7168,
2025-05-07T20:32:34.8739774Z     scale_ub=None,
2025-05-07T20:32:34.8739991Z     contiguous=False,
2025-05-07T20:32:34.8740231Z     compiled=False,
2025-05-07T20:32:34.8740448Z )
2025-05-07T20:32:34.8740774Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8741281Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.8741555Z 
2025-05-07T20:32:34.8741644Z     @given(
2025-05-07T20:32:34.8741882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8742288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8742604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8742936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8743263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8743558Z     )
2025-05-07T20:32:34.8743917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8744355Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8744602Z         self,
2025-05-07T20:32:34.8744804Z         T: int,
2025-05-07T20:32:34.8745001Z         D: int,
2025-05-07T20:32:34.8745254Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8745554Z         contiguous: bool,
2025-05-07T20:32:34.8745792Z         compiled: bool,
2025-05-07T20:32:34.8746017Z     ) -> None:
2025-05-07T20:32:34.8746235Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8746474Z     
2025-05-07T20:32:34.8746757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8747110Z     
2025-05-07T20:32:34.8747308Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.8747596Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.8747908Z         x = x_sign * x_clamp
2025-05-07T20:32:34.8748153Z         x0 = x[:, :D]
2025-05-07T20:32:34.8748366Z         x1 = x[:, D:]
2025-05-07T20:32:34.8748574Z     
2025-05-07T20:32:34.8748766Z         if contiguous:
2025-05-07T20:32:34.8748995Z             x0 = x0.contiguous()
2025-05-07T20:32:34.8749259Z             x1 = x1.contiguous()
2025-05-07T20:32:34.8749505Z     
2025-05-07T20:32:34.8749695Z         if scale_ub is not None:
2025-05-07T20:32:34.8749971Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.8750307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.8750614Z             )
2025-05-07T20:32:34.8750807Z         else:
2025-05-07T20:32:34.8751025Z             scale_ub_tensor = None
2025-05-07T20:32:34.8751348Z     
2025-05-07T20:32:34.8751590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.8751910Z             op = silu_mul_quant
2025-05-07T20:32:34.8752153Z             if compiled:
2025-05-07T20:32:34.8752405Z                 op = torch.compile(op)
2025-05-07T20:32:34.8752706Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8752990Z     
2025-05-07T20:32:34.8753182Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.8753355Z 
2025-05-07T20:32:34.8753455Z moe/activation_test.py:117: 
2025-05-07T20:32:34.8753753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8754081Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.8754369Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8755066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.8755838Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.8756440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.8757137Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.8757809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.8758344Z     kernel = self.compile(
2025-05-07T20:32:34.8758899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.8759572Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.8759982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8760213Z 
2025-05-07T20:32:34.8760421Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a99e7c50>
2025-05-07T20:32:34.8761513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.8762932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9e7b880>}
2025-05-07T20:32:34.8764392Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.8765417Z context = <triton._C.libtriton.ir.context object at 0x7fb3a9990270>
2025-05-07T20:32:34.8765710Z 
2025-05-07T20:32:34.8765878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.8766406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.8766877Z                            module_map=module_map)
2025-05-07T20:32:34.8767246Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.8767605Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.8767869Z E       ^
2025-05-07T20:32:34.8768329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.8768790Z 
2025-05-07T20:32:34.8769213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.8769738Z 
2025-05-07T20:32:34.8769843Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8770258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8770661Z     T=128,
2025-05-07T20:32:34.8770853Z     D=7168,
2025-05-07T20:32:34.8771048Z     scale_ub=None,
2025-05-07T20:32:34.8771260Z     contiguous=False,
2025-05-07T20:32:34.8771493Z     compiled=True,
2025-05-07T20:32:34.8771703Z )
2025-05-07T20:32:34.9226673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9227461Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:34.9227826Z 
2025-05-07T20:32:34.9227909Z     @given(
2025-05-07T20:32:34.9228145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9228466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9228774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9229111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9229446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9229732Z     )
2025-05-07T20:32:34.9230088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9230536Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9230771Z         self,
2025-05-07T20:32:34.9230971Z         T: int,
2025-05-07T20:32:34.9231273Z         D: int,
2025-05-07T20:32:34.9231579Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9231866Z         contiguous: bool,
2025-05-07T20:32:34.9232113Z         compiled: bool,
2025-05-07T20:32:34.9232341Z     ) -> None:
2025-05-07T20:32:34.9232566Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9232820Z     
2025-05-07T20:32:34.9233092Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9233441Z     
2025-05-07T20:32:34.9233641Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.9233938Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.9234246Z         x = x_sign * x_clamp
2025-05-07T20:32:34.9234493Z         x0 = x[:, :D]
2025-05-07T20:32:34.9234711Z         x1 = x[:, D:]
2025-05-07T20:32:34.9234920Z     
2025-05-07T20:32:34.9235113Z         if contiguous:
2025-05-07T20:32:34.9235351Z             x0 = x0.contiguous()
2025-05-07T20:32:34.9235609Z             x1 = x1.contiguous()
2025-05-07T20:32:34.9235859Z     
2025-05-07T20:32:34.9236058Z         if scale_ub is not None:
2025-05-07T20:32:34.9236422Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.9236771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.9237085Z             )
2025-05-07T20:32:34.9237274Z         else:
2025-05-07T20:32:34.9237494Z             scale_ub_tensor = None
2025-05-07T20:32:34.9237753Z     
2025-05-07T20:32:34.9237987Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.9238313Z             op = silu_mul_quant
2025-05-07T20:32:34.9238782Z             if compiled:
2025-05-07T20:32:34.9239036Z                 op = torch.compile(op)
2025-05-07T20:32:34.9239329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9239614Z     
2025-05-07T20:32:34.9239816Z         y_fp8, y_scale = fn()
2025-05-07T20:32:34.9240101Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:34.9240399Z     
2025-05-07T20:32:34.9240650Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.9240990Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:34.9241297Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:34.9241621Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:34.9241979Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.9242299Z     
2025-05-07T20:32:34.9242512Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:34.9242708Z 
2025-05-07T20:32:34.9242823Z moe/activation_test.py:126: 
2025-05-07T20:32:34.9243117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9243464Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:34.9243929Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.9244718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:34.9245543Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:34.9246218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.9246916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.9247610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:34.9248348Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.9249123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:34.9249886Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.9250621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:34.9251331Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:34.9252003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:34.9252528Z     fn()
2025-05-07T20:32:34.9253047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:34.9253642Z     self.fn.run(
2025-05-07T20:32:34.9254124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.9254664Z     kernel = self.compile(
2025-05-07T20:32:34.9255221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.9255888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.9256289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9256523Z 
2025-05-07T20:32:34.9256738Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a92f1910>
2025-05-07T20:32:34.9257889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.9259270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9a6ed40>}
2025-05-07T20:32:34.9260616Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.9261646Z context = <triton._C.libtriton.ir.context object at 0x7fb3a93f3ef0>
2025-05-07T20:32:34.9261935Z 
2025-05-07T20:32:34.9262112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.9262649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.9263122Z                            module_map=module_map)
2025-05-07T20:32:34.9263493Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.9263853Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:34.9264114Z E       ^
2025-05-07T20:32:34.9264582Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.9265033Z 
2025-05-07T20:32:34.9265461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.9265977Z 
2025-05-07T20:32:34.9266087Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9266498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9266904Z     T=128,
2025-05-07T20:32:34.9267096Z     D=7168,
2025-05-07T20:32:34.9267289Z     scale_ub=None,
2025-05-07T20:32:34.9267562Z     contiguous=False,
2025-05-07T20:32:34.9267797Z     compiled=False,
2025-05-07T20:32:34.9268007Z )
2025-05-07T20:32:35.2439244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2440022Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.2440400Z 
2025-05-07T20:32:35.2440506Z     @given(
2025-05-07T20:32:35.2440828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2441143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2441451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2441782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2442106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2442395Z     )
2025-05-07T20:32:35.2442749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2443502Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2443896Z         self,
2025-05-07T20:32:35.2444202Z         T: int,
2025-05-07T20:32:35.2444400Z         D: int,
2025-05-07T20:32:35.2444614Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2444886Z         contiguous: bool,
2025-05-07T20:32:35.2445124Z         compiled: bool,
2025-05-07T20:32:35.2445347Z     ) -> None:
2025-05-07T20:32:35.2445563Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2445806Z     
2025-05-07T20:32:35.2446076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2446422Z     
2025-05-07T20:32:35.2446617Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2446905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2447218Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2447456Z         x0 = x[:, :D]
2025-05-07T20:32:35.2447666Z         x1 = x[:, D:]
2025-05-07T20:32:35.2447879Z     
2025-05-07T20:32:35.2448072Z         if contiguous:
2025-05-07T20:32:35.2448306Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2448574Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2448911Z     
2025-05-07T20:32:35.2449109Z         if scale_ub is not None:
2025-05-07T20:32:35.2449382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2449722Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2450040Z             )
2025-05-07T20:32:35.2450234Z         else:
2025-05-07T20:32:35.2450447Z             scale_ub_tensor = None
2025-05-07T20:32:35.2450705Z     
2025-05-07T20:32:35.2450934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2451246Z             op = silu_mul_quant
2025-05-07T20:32:35.2451496Z             if compiled:
2025-05-07T20:32:35.2451735Z                 op = torch.compile(op)
2025-05-07T20:32:35.2452034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2452309Z     
2025-05-07T20:32:35.2452496Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2452666Z 
2025-05-07T20:32:35.2452768Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2453071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2453409Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2453690Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2454382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2455083Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2455619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2456307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2456980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2457519Z     kernel = self.compile(
2025-05-07T20:32:35.2458150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2458818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2459220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2459447Z 
2025-05-07T20:32:35.2459661Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a98fecd0>
2025-05-07T20:32:35.2460733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2462110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9348e00>}
2025-05-07T20:32:35.2463500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2464590Z context = <triton._C.libtriton.ir.context object at 0x7fb3a98f32f0>
2025-05-07T20:32:35.2464880Z 
2025-05-07T20:32:35.2465054Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2465634Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2466109Z                            module_map=module_map)
2025-05-07T20:32:35.2466481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2466835Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2467102Z E       ^
2025-05-07T20:32:35.2467573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2468024Z 
2025-05-07T20:32:35.2468448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2468977Z 
2025-05-07T20:32:35.2469127Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2469548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2469958Z     T=4096,
2025-05-07T20:32:35.2470144Z     D=5120,
2025-05-07T20:32:35.2470342Z     scale_ub=1200.0,
2025-05-07T20:32:35.2470565Z     contiguous=True,
2025-05-07T20:32:35.2470782Z     compiled=False,
2025-05-07T20:32:35.2470993Z )
2025-05-07T20:32:35.2471313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2471827Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.2472101Z 
2025-05-07T20:32:35.2472186Z     @given(
2025-05-07T20:32:35.2472415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2472734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2473046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2473383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2473716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2474010Z     )
2025-05-07T20:32:35.2474366Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2474805Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2475050Z         self,
2025-05-07T20:32:35.2475248Z         T: int,
2025-05-07T20:32:35.2475440Z         D: int,
2025-05-07T20:32:35.2475662Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2475937Z         contiguous: bool,
2025-05-07T20:32:35.2476172Z         compiled: bool,
2025-05-07T20:32:35.2476397Z     ) -> None:
2025-05-07T20:32:35.2476617Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2476857Z     
2025-05-07T20:32:35.2477128Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2477472Z     
2025-05-07T20:32:35.2477658Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2477953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2478318Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2478560Z         x0 = x[:, :D]
2025-05-07T20:32:35.2478772Z         x1 = x[:, D:]
2025-05-07T20:32:35.2478978Z     
2025-05-07T20:32:35.2479166Z         if contiguous:
2025-05-07T20:32:35.2479392Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2479651Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2479897Z     
2025-05-07T20:32:35.2480086Z         if scale_ub is not None:
2025-05-07T20:32:35.2480360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2480697Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2481001Z             )
2025-05-07T20:32:35.2481202Z         else:
2025-05-07T20:32:35.2481414Z             scale_ub_tensor = None
2025-05-07T20:32:35.2481659Z     
2025-05-07T20:32:35.2481895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2482254Z             op = silu_mul_quant
2025-05-07T20:32:35.2482497Z             if compiled:
2025-05-07T20:32:35.2482798Z                 op = torch.compile(op)
2025-05-07T20:32:35.2483097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2483371Z     
2025-05-07T20:32:35.2483673Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2483845Z 
2025-05-07T20:32:35.2483943Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2484239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2484571Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2484860Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2485563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2486261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2486797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2487493Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2488242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2488776Z     kernel = self.compile(
2025-05-07T20:32:35.2489320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2489984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2490389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2490617Z 
2025-05-07T20:32:35.2490827Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8a27810>
2025-05-07T20:32:35.2491913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2493286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9349f80>}
2025-05-07T20:32:35.2494645Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2495726Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8a2be70>
2025-05-07T20:32:35.2496022Z 
2025-05-07T20:32:35.2496189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2496716Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2497188Z                            module_map=module_map)
2025-05-07T20:32:35.2497551Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2497908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2498173Z E       ^
2025-05-07T20:32:35.2498713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2499168Z 
2025-05-07T20:32:35.2499590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2500112Z 
2025-05-07T20:32:35.2500216Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2500633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2501033Z     T=1,
2025-05-07T20:32:35.2501218Z     D=5120,
2025-05-07T20:32:35.2501412Z     scale_ub=None,
2025-05-07T20:32:35.2501629Z     contiguous=True,
2025-05-07T20:32:35.2501846Z     compiled=True,
2025-05-07T20:32:35.2502055Z )
2025-05-07T20:32:35.7016096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7016963Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7017324Z 
2025-05-07T20:32:35.7017425Z     @given(
2025-05-07T20:32:35.7017669Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7017986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7018298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7018629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7018964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7019257Z     )
2025-05-07T20:32:35.7019607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7020056Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7020301Z         self,
2025-05-07T20:32:35.7020493Z         T: int,
2025-05-07T20:32:35.7020697Z         D: int,
2025-05-07T20:32:35.7020916Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7021184Z         contiguous: bool,
2025-05-07T20:32:35.7021426Z         compiled: bool,
2025-05-07T20:32:35.7021659Z     ) -> None:
2025-05-07T20:32:35.7021878Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7022225Z     
2025-05-07T20:32:35.7022501Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7022849Z     
2025-05-07T20:32:35.7023040Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7023333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7023647Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7023881Z         x0 = x[:, :D]
2025-05-07T20:32:35.7024100Z         x1 = x[:, D:]
2025-05-07T20:32:35.7024312Z     
2025-05-07T20:32:35.7024496Z         if contiguous:
2025-05-07T20:32:35.7024733Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7024993Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7025233Z     
2025-05-07T20:32:35.7025429Z         if scale_ub is not None:
2025-05-07T20:32:35.7025704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7026040Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7026354Z             )
2025-05-07T20:32:35.7026556Z         else:
2025-05-07T20:32:35.7026766Z             scale_ub_tensor = None
2025-05-07T20:32:35.7027022Z     
2025-05-07T20:32:35.7027260Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7027576Z             op = silu_mul_quant
2025-05-07T20:32:35.7027823Z             if compiled:
2025-05-07T20:32:35.7028072Z                 op = torch.compile(op)
2025-05-07T20:32:35.7028377Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7028649Z     
2025-05-07T20:32:35.7028853Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7029144Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7029435Z     
2025-05-07T20:32:35.7029678Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7030018Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7030307Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7030627Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7031092Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7031414Z     
2025-05-07T20:32:35.7031616Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7031819Z 
2025-05-07T20:32:35.7031924Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7032227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7032570Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7032906Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7033706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7034461Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7035016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7035762Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7036505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7037229Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7037993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7039165Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7040132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7040774Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7041383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7041916Z     fn()
2025-05-07T20:32:35.7042429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7043126Z     self.fn.run(
2025-05-07T20:32:35.7043739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7044280Z     kernel = self.compile(
2025-05-07T20:32:35.7044825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7045492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7045899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7046130Z 
2025-05-07T20:32:35.7046343Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8aa1a90>
2025-05-07T20:32:35.7047422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7048816Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a934afc0>}
2025-05-07T20:32:35.7050165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7051200Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8a660b0>
2025-05-07T20:32:35.7051491Z 
2025-05-07T20:32:35.7051660Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7052191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7052662Z                            module_map=module_map)
2025-05-07T20:32:35.7053038Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7053479Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7053754Z E       ^
2025-05-07T20:32:35.7054228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7054687Z 
2025-05-07T20:32:35.7055112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7055688Z 
2025-05-07T20:32:35.7055795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7056214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7056623Z     T=2048,
2025-05-07T20:32:35.7056812Z     D=5120,
2025-05-07T20:32:35.7057010Z     scale_ub=None,
2025-05-07T20:32:35.7057233Z     contiguous=True,
2025-05-07T20:32:35.7057454Z     compiled=True,
2025-05-07T20:32:35.7057667Z )
2025-05-07T20:32:36.1451677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1452541Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.1459226Z 
2025-05-07T20:32:36.1459321Z     @given(
2025-05-07T20:32:36.1459583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1459900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1460217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1460559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1460888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1461181Z     )
2025-05-07T20:32:36.1461540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1461991Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1462233Z         self,
2025-05-07T20:32:36.1462436Z         T: int,
2025-05-07T20:32:36.1462645Z         D: int,
2025-05-07T20:32:36.1462860Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1463149Z         contiguous: bool,
2025-05-07T20:32:36.1463406Z         compiled: bool,
2025-05-07T20:32:36.1463790Z     ) -> None:
2025-05-07T20:32:36.1464021Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1464274Z     
2025-05-07T20:32:36.1464547Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1464903Z     
2025-05-07T20:32:36.1465112Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1465404Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1465718Z         x = x_sign * x_clamp
2025-05-07T20:32:36.1465969Z         x0 = x[:, :D]
2025-05-07T20:32:36.1466183Z         x1 = x[:, D:]
2025-05-07T20:32:36.1466396Z     
2025-05-07T20:32:36.1466591Z         if contiguous:
2025-05-07T20:32:36.1466825Z             x0 = x0.contiguous()
2025-05-07T20:32:36.1467096Z             x1 = x1.contiguous()
2025-05-07T20:32:36.1467347Z     
2025-05-07T20:32:36.1467546Z         if scale_ub is not None:
2025-05-07T20:32:36.1467817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.1468165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.1468486Z             )
2025-05-07T20:32:36.1468678Z         else:
2025-05-07T20:32:36.1468896Z             scale_ub_tensor = None
2025-05-07T20:32:36.1469154Z     
2025-05-07T20:32:36.1469389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1469709Z             op = silu_mul_quant
2025-05-07T20:32:36.1469965Z             if compiled:
2025-05-07T20:32:36.1470211Z                 op = torch.compile(op)
2025-05-07T20:32:36.1470517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1470801Z     
2025-05-07T20:32:36.1470993Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.1471286Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.1471583Z     
2025-05-07T20:32:36.1471816Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1472158Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.1472552Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.1472883Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.1473242Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.1473564Z     
2025-05-07T20:32:36.1473775Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.1473971Z 
2025-05-07T20:32:36.1474073Z moe/activation_test.py:126: 
2025-05-07T20:32:36.1474381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1474724Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.1475050Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.1475903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.1476673Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.1477286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.1478019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.1478717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.1479456Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.1480214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.1480963Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.1481698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.1482347Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.1482965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.1483720Z     fn()
2025-05-07T20:32:36.1484236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.1484827Z     self.fn.run(
2025-05-07T20:32:36.1485294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.1485832Z     kernel = self.compile(
2025-05-07T20:32:36.1486381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.1487050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1487448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1487685Z 
2025-05-07T20:32:36.1487894Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a89f1910>
2025-05-07T20:32:36.1488979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.1490362Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a90d07c0>}
2025-05-07T20:32:36.1491699Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.1492726Z context = <triton._C.libtriton.ir.context object at 0x7fb3a986d8b0>
2025-05-07T20:32:36.1493015Z 
2025-05-07T20:32:36.1493189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.1493708Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1494232Z                            module_map=module_map)
2025-05-07T20:32:36.1494607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1494970Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.1495233Z E       ^
2025-05-07T20:32:36.1495694Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1496144Z 
2025-05-07T20:32:36.1496568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.1497080Z 
2025-05-07T20:32:36.1497185Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1497596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1497998Z     T=128,
2025-05-07T20:32:36.1498189Z     D=5120,
2025-05-07T20:32:36.1498377Z     scale_ub=None,
2025-05-07T20:32:36.1498594Z     contiguous=True,
2025-05-07T20:32:36.1498863Z     compiled=True,
2025-05-07T20:32:36.1499097Z )
2025-05-07T20:32:36.8355571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.8356265Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.8356534Z 
2025-05-07T20:32:36.8356619Z     @given(
2025-05-07T20:32:36.8356867Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.8357191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.8357513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.8357846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.8358180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.8358470Z     )
2025-05-07T20:32:36.8358820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.8359272Z     def test_silu_mul_quant(
2025-05-07T20:32:36.8359524Z         self,
2025-05-07T20:32:36.8359722Z         T: int,
2025-05-07T20:32:36.8359927Z         D: int,
2025-05-07T20:32:36.8360163Z         scale_ub: Optional[float],
2025-05-07T20:32:36.8360756Z         contiguous: bool,
2025-05-07T20:32:36.8361004Z         compiled: bool,
2025-05-07T20:32:36.8361239Z     ) -> None:
2025-05-07T20:32:36.8361454Z         torch.manual_seed(2025)
2025-05-07T20:32:36.8361702Z     
2025-05-07T20:32:36.8361984Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.8362333Z     
2025-05-07T20:32:36.8362524Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.8362822Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.8363139Z         x = x_sign * x_clamp
2025-05-07T20:32:36.8363380Z         x0 = x[:, :D]
2025-05-07T20:32:36.8363746Z         x1 = x[:, D:]
2025-05-07T20:32:36.8363965Z     
2025-05-07T20:32:36.8364156Z         if contiguous:
2025-05-07T20:32:36.8364396Z             x0 = x0.contiguous()
2025-05-07T20:32:36.8364660Z             x1 = x1.contiguous()
2025-05-07T20:32:36.8364904Z     
2025-05-07T20:32:36.8365104Z         if scale_ub is not None:
2025-05-07T20:32:36.8365390Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.8365741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.8366091Z             )
2025-05-07T20:32:36.8366287Z         else:
2025-05-07T20:32:36.8366495Z             scale_ub_tensor = None
2025-05-07T20:32:36.8366753Z     
2025-05-07T20:32:36.8366993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.8367306Z             op = silu_mul_quant
2025-05-07T20:32:36.8367566Z             if compiled:
2025-05-07T20:32:36.8367818Z                 op = torch.compile(op)
2025-05-07T20:32:36.8368125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.8368401Z     
2025-05-07T20:32:36.8368603Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.8368893Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.8369186Z     
2025-05-07T20:32:36.8369438Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.8369898Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.8370204Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.8370520Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.8370892Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.8371211Z     
2025-05-07T20:32:36.8371411Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.8371613Z 
2025-05-07T20:32:36.8371717Z moe/activation_test.py:126: 
2025-05-07T20:32:36.8372018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.8372354Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.8372688Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.8373487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.8374344Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.8374976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.8375671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.8376374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.8377105Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.8377860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.8378618Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.8379357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.8379998Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.8380658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.8381181Z     fn()
2025-05-07T20:32:36.8381698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.8382277Z     self.fn.run(
2025-05-07T20:32:36.8382750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.8383289Z     kernel = self.compile(
2025-05-07T20:32:36.8383830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.8384493Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.8384888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.8385117Z 
2025-05-07T20:32:36.8385342Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8758b10>
2025-05-07T20:32:36.8386423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.8387813Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8ddde40>}
2025-05-07T20:32:36.8389163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.8390194Z context = <triton._C.libtriton.ir.context object at 0x7fb3a875d070>
2025-05-07T20:32:36.8390483Z 
2025-05-07T20:32:36.8390657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.8391222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.8391700Z                            module_map=module_map)
2025-05-07T20:32:36.8392071Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.8392443Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.8392714Z E       ^
2025-05-07T20:32:36.8393182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.8393634Z 
2025-05-07T20:32:36.8394061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.8394578Z 
2025-05-07T20:32:36.8394683Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.8395101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.8395508Z     T=4096,
2025-05-07T20:32:36.8395793Z     D=5120,
2025-05-07T20:32:36.8396028Z     scale_ub=None,
2025-05-07T20:32:36.8396309Z     contiguous=True,
2025-05-07T20:32:36.8396533Z     compiled=True,
2025-05-07T20:32:36.8396751Z )
2025-05-07T20:32:37.3588201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.3588960Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.3589299Z 
2025-05-07T20:32:37.3589381Z     @given(
2025-05-07T20:32:37.3589617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.3589929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.3590238Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.3590573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.3590900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.3591191Z     )
2025-05-07T20:32:37.3591543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.3592008Z     def test_silu_mul_quant(
2025-05-07T20:32:37.3592259Z         self,
2025-05-07T20:32:37.3592815Z         T: int,
2025-05-07T20:32:37.3593022Z         D: int,
2025-05-07T20:32:37.3593235Z         scale_ub: Optional[float],
2025-05-07T20:32:37.3593516Z         contiguous: bool,
2025-05-07T20:32:37.3593760Z         compiled: bool,
2025-05-07T20:32:37.3593989Z     ) -> None:
2025-05-07T20:32:37.3594206Z         torch.manual_seed(2025)
2025-05-07T20:32:37.3594450Z     
2025-05-07T20:32:37.3594724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.3595072Z     
2025-05-07T20:32:37.3595267Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.3595554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.3595868Z         x = x_sign * x_clamp
2025-05-07T20:32:37.3596113Z         x0 = x[:, :D]
2025-05-07T20:32:37.3596329Z         x1 = x[:, D:]
2025-05-07T20:32:37.3596540Z     
2025-05-07T20:32:37.3596732Z         if contiguous:
2025-05-07T20:32:37.3596966Z             x0 = x0.contiguous()
2025-05-07T20:32:37.3597239Z             x1 = x1.contiguous()
2025-05-07T20:32:37.3597486Z     
2025-05-07T20:32:37.3597682Z         if scale_ub is not None:
2025-05-07T20:32:37.3597959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.3598299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.3598615Z             )
2025-05-07T20:32:37.3598802Z         else:
2025-05-07T20:32:37.3599022Z             scale_ub_tensor = None
2025-05-07T20:32:37.3599284Z     
2025-05-07T20:32:37.3599524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.3599847Z             op = silu_mul_quant
2025-05-07T20:32:37.3600108Z             if compiled:
2025-05-07T20:32:37.3600350Z                 op = torch.compile(op)
2025-05-07T20:32:37.3600651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.3600942Z     
2025-05-07T20:32:37.3601137Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.3601439Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.3601738Z     
2025-05-07T20:32:37.3602105Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.3602448Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.3602746Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.3603064Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.3603417Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.3603954Z     
2025-05-07T20:32:37.3604159Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.3604354Z 
2025-05-07T20:32:37.3604458Z moe/activation_test.py:126: 
2025-05-07T20:32:37.3604759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.3605099Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.3605423Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.3606327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.3607187Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.3607738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.3608420Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.3609122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.3609863Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.3610626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.3611379Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.3612121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.3612816Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.3613422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.3613941Z     fn()
2025-05-07T20:32:37.3614460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.3615052Z     self.fn.run(
2025-05-07T20:32:37.3615522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.3616057Z     kernel = self.compile(
2025-05-07T20:32:37.3616606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.3617260Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.3617661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.3617904Z 
2025-05-07T20:32:37.3618117Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a82b1910>
2025-05-07T20:32:37.3619205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.3620601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a887e700>}
2025-05-07T20:32:37.3621951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.3622987Z context = <triton._C.libtriton.ir.context object at 0x7fb3a9187130>
2025-05-07T20:32:37.3623286Z 
2025-05-07T20:32:37.3623498Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.3624033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.3624503Z                            module_map=module_map)
2025-05-07T20:32:37.3624871Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.3625236Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.3625503Z E       ^
2025-05-07T20:32:37.3625977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.3626438Z 
2025-05-07T20:32:37.3626860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.3627376Z 
2025-05-07T20:32:37.3627491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.3627949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.3628394Z     T=16384,
2025-05-07T20:32:37.3628603Z     D=5120,
2025-05-07T20:32:37.3628798Z     scale_ub=None,
2025-05-07T20:32:37.3629016Z     contiguous=True,
2025-05-07T20:32:37.3629249Z     compiled=True,
2025-05-07T20:32:37.3629454Z )
2025-05-07T20:32:37.3893399Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:37.3894940Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:37.3896339Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:37.3897343Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:37.3898633Z W0507 20:32:37.388000 88291 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:37.4595659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4596453Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4596835Z 
2025-05-07T20:32:37.4596956Z     @given(
2025-05-07T20:32:37.4597229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4597547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4597856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4598181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4598513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4598804Z     )
2025-05-07T20:32:37.4599168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4599634Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4599884Z         self,
2025-05-07T20:32:37.4600079Z         T: int,
2025-05-07T20:32:37.4600279Z         D: int,
2025-05-07T20:32:37.4600507Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4600776Z         contiguous: bool,
2025-05-07T20:32:37.4601025Z         compiled: bool,
2025-05-07T20:32:37.4601257Z     ) -> None:
2025-05-07T20:32:37.4601470Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4601714Z     
2025-05-07T20:32:37.4601999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4602343Z     
2025-05-07T20:32:37.4602560Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4602849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4603164Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4603408Z         x0 = x[:, :D]
2025-05-07T20:32:37.4603801Z         x1 = x[:, D:]
2025-05-07T20:32:37.4604013Z     
2025-05-07T20:32:37.4604395Z         if contiguous:
2025-05-07T20:32:37.4604635Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4604906Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4605155Z     
2025-05-07T20:32:37.4605347Z         if scale_ub is not None:
2025-05-07T20:32:37.4605624Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4605967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4606272Z             )
2025-05-07T20:32:37.4606468Z         else:
2025-05-07T20:32:37.4606684Z             scale_ub_tensor = None
2025-05-07T20:32:37.4606937Z     
2025-05-07T20:32:37.4607173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4607494Z             op = silu_mul_quant
2025-05-07T20:32:37.4607749Z             if compiled:
2025-05-07T20:32:37.4607994Z                 op = torch.compile(op)
2025-05-07T20:32:37.4608302Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4608671Z     
2025-05-07T20:32:37.4608868Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4609244Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4609541Z     
2025-05-07T20:32:37.4609775Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4610115Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4610411Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4610723Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4611085Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4611401Z     
2025-05-07T20:32:37.4611605Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4611808Z 
2025-05-07T20:32:37.4611912Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4612211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4612553Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4612885Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4613692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4614542Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4615095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4615790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4616529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4623716Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4624535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4625314Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4626116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4626769Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4627374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4627902Z     fn()
2025-05-07T20:32:37.4628422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4629001Z     self.fn.run(
2025-05-07T20:32:37.4629476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4630019Z     kernel = self.compile(
2025-05-07T20:32:37.4630571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4631229Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4631713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4631946Z 
2025-05-07T20:32:37.4632164Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a806c8d0>
2025-05-07T20:32:37.4633244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4634634Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8415d00>}
2025-05-07T20:32:37.4636022Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4637067Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8080eb0>
2025-05-07T20:32:37.4637398Z 
2025-05-07T20:32:37.4637574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4638097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4638877Z                            module_map=module_map)
2025-05-07T20:32:37.4639254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4639609Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4639884Z E       ^
2025-05-07T20:32:37.4640359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4640815Z 
2025-05-07T20:32:37.4641248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4641769Z 
2025-05-07T20:32:37.4641878Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4642305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4642832Z     T=1,
2025-05-07T20:32:37.4643015Z     D=5120,
2025-05-07T20:32:37.4643212Z     scale_ub=1200.0,
2025-05-07T20:32:37.4643447Z     contiguous=True,
2025-05-07T20:32:37.4643730Z     compiled=True,
2025-05-07T20:32:37.4643935Z )
2025-05-07T20:32:37.7505155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.7505878Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.7506253Z 
2025-05-07T20:32:37.7506363Z     @given(
2025-05-07T20:32:37.7506601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.7506931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.7507244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.7507574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.7507935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.7508249Z     )
2025-05-07T20:32:37.7508617Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.7509063Z     def test_silu_mul_quant(
2025-05-07T20:32:37.7509319Z         self,
2025-05-07T20:32:37.7509528Z         T: int,
2025-05-07T20:32:37.7509730Z         D: int,
2025-05-07T20:32:37.7509968Z         scale_ub: Optional[float],
2025-05-07T20:32:37.7510254Z         contiguous: bool,
2025-05-07T20:32:37.7510500Z         compiled: bool,
2025-05-07T20:32:37.7510740Z     ) -> None:
2025-05-07T20:32:37.7510969Z         torch.manual_seed(2025)
2025-05-07T20:32:37.7511214Z     
2025-05-07T20:32:37.7511506Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.7511858Z     
2025-05-07T20:32:37.7512061Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.7512364Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.7512684Z         x = x_sign * x_clamp
2025-05-07T20:32:37.7512928Z         x0 = x[:, :D]
2025-05-07T20:32:37.7513473Z         x1 = x[:, D:]
2025-05-07T20:32:37.7513702Z     
2025-05-07T20:32:37.7513898Z         if contiguous:
2025-05-07T20:32:37.7514149Z             x0 = x0.contiguous()
2025-05-07T20:32:37.7514423Z             x1 = x1.contiguous()
2025-05-07T20:32:37.7514680Z     
2025-05-07T20:32:37.7514876Z         if scale_ub is not None:
2025-05-07T20:32:37.7515161Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.7515506Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.7515817Z             )
2025-05-07T20:32:37.7516027Z         else:
2025-05-07T20:32:37.7516271Z             scale_ub_tensor = None
2025-05-07T20:32:37.7516554Z     
2025-05-07T20:32:37.7516800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.7517123Z             op = silu_mul_quant
2025-05-07T20:32:37.7517382Z             if compiled:
2025-05-07T20:32:37.7517755Z                 op = torch.compile(op)
2025-05-07T20:32:37.7518062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.7518431Z     
2025-05-07T20:32:37.7518634Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.7518802Z 
2025-05-07T20:32:37.7518912Z moe/activation_test.py:117: 
2025-05-07T20:32:37.7519211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.7519553Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.7519845Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.7520402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.7520971Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.7521639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.7522338Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.7522880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.7523808Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.7524495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.7525036Z     kernel = self.compile(
2025-05-07T20:32:37.7525599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.7526269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.7526726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.7526953Z 
2025-05-07T20:32:37.7527166Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287992610>
2025-05-07T20:32:37.7528254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.7529652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8670ae0>}
2025-05-07T20:32:37.7531005Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.7532045Z context = <triton._C.libtriton.ir.context object at 0x7fb2879a27b0>
2025-05-07T20:32:37.7532333Z 
2025-05-07T20:32:37.7532501Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.7533026Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.7533495Z                            module_map=module_map)
2025-05-07T20:32:37.7533861Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.7534272Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.7534541Z E       ^
2025-05-07T20:32:37.7535012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.7535465Z 
2025-05-07T20:32:37.7535887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.7536410Z 
2025-05-07T20:32:37.7536516Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.7536931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.7537331Z     T=1,
2025-05-07T20:32:37.7537523Z     D=5120,
2025-05-07T20:32:37.7537721Z     scale_ub=None,
2025-05-07T20:32:37.7537938Z     contiguous=False,
2025-05-07T20:32:37.7538174Z     compiled=True,
2025-05-07T20:32:37.7538809Z )
2025-05-07T20:32:37.8019793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.8020617Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.8020979Z 
2025-05-07T20:32:37.8021087Z     @given(
2025-05-07T20:32:37.8021393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.8021809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.8022155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.8022490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.8022825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.8023117Z     )
2025-05-07T20:32:37.8023465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.8023913Z     def test_silu_mul_quant(
2025-05-07T20:32:37.8024160Z         self,
2025-05-07T20:32:37.8024353Z         T: int,
2025-05-07T20:32:37.8024557Z         D: int,
2025-05-07T20:32:37.8024778Z         scale_ub: Optional[float],
2025-05-07T20:32:37.8025050Z         contiguous: bool,
2025-05-07T20:32:37.8025296Z         compiled: bool,
2025-05-07T20:32:37.8025622Z     ) -> None:
2025-05-07T20:32:37.8025838Z         torch.manual_seed(2025)
2025-05-07T20:32:37.8026095Z     
2025-05-07T20:32:37.8026375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.8026715Z     
2025-05-07T20:32:37.8026915Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.8027212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.8027521Z         x = x_sign * x_clamp
2025-05-07T20:32:37.8027766Z         x0 = x[:, :D]
2025-05-07T20:32:37.8027989Z         x1 = x[:, D:]
2025-05-07T20:32:37.8028205Z     
2025-05-07T20:32:37.8028391Z         if contiguous:
2025-05-07T20:32:37.8028631Z             x0 = x0.contiguous()
2025-05-07T20:32:37.8028896Z             x1 = x1.contiguous()
2025-05-07T20:32:37.8029137Z     
2025-05-07T20:32:37.8029337Z         if scale_ub is not None:
2025-05-07T20:32:37.8029617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.8029957Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.8030279Z             )
2025-05-07T20:32:37.8030481Z         else:
2025-05-07T20:32:37.8030693Z             scale_ub_tensor = None
2025-05-07T20:32:37.8030952Z     
2025-05-07T20:32:37.8031192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8031509Z             op = silu_mul_quant
2025-05-07T20:32:37.8031764Z             if compiled:
2025-05-07T20:32:37.8032016Z                 op = torch.compile(op)
2025-05-07T20:32:37.8032317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8032598Z     
2025-05-07T20:32:37.8032802Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.8033096Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.8033390Z     
2025-05-07T20:32:37.8033639Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8033982Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.8034278Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.8034690Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.8035063Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.8035375Z     
2025-05-07T20:32:37.8035585Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.8035781Z 
2025-05-07T20:32:37.8035891Z moe/activation_test.py:126: 
2025-05-07T20:32:37.8036192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8036539Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.8036874Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.8037668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.8038696Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.8039333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.8040099Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.8040807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.8041535Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.8042292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.8043049Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.8043936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.8044586Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.8045197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.8045812Z     fn()
2025-05-07T20:32:37.8046332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.8046920Z     self.fn.run(
2025-05-07T20:32:37.8047400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.8047933Z     kernel = self.compile(
2025-05-07T20:32:37.8048482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.8049146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.8049541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8049780Z 
2025-05-07T20:32:37.8049990Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28795e410>
2025-05-07T20:32:37.8051071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.8052455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8699e40>}
2025-05-07T20:32:37.8053801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.8054823Z context = <triton._C.libtriton.ir.context object at 0x7fb2879dea30>
2025-05-07T20:32:37.8055118Z 
2025-05-07T20:32:37.8055286Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.8055811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.8056284Z                            module_map=module_map)
2025-05-07T20:32:37.8056719Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.8057088Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.8057360Z E       ^
2025-05-07T20:32:37.8057820Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.8058277Z 
2025-05-07T20:32:37.8058699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.8059222Z 
2025-05-07T20:32:37.8059326Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.8059742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.8060144Z     T=1,
2025-05-07T20:32:37.8060333Z     D=5120,
2025-05-07T20:32:37.8060531Z     scale_ub=None,
2025-05-07T20:32:37.8060742Z     contiguous=True,
2025-05-07T20:32:37.8061013Z     compiled=False,
2025-05-07T20:32:37.8061236Z )
2025-05-07T20:32:37.9233402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.9234137Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.9234500Z 
2025-05-07T20:32:37.9234606Z     @given(
2025-05-07T20:32:37.9234918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.9235342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.9235656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.9235992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.9236313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.9236600Z     )
2025-05-07T20:32:37.9236953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.9237397Z     def test_silu_mul_quant(
2025-05-07T20:32:37.9237634Z         self,
2025-05-07T20:32:37.9237833Z         T: int,
2025-05-07T20:32:37.9238038Z         D: int,
2025-05-07T20:32:37.9238252Z         scale_ub: Optional[float],
2025-05-07T20:32:37.9239164Z         contiguous: bool,
2025-05-07T20:32:37.9239408Z         compiled: bool,
2025-05-07T20:32:37.9239632Z     ) -> None:
2025-05-07T20:32:37.9239850Z         torch.manual_seed(2025)
2025-05-07T20:32:37.9240096Z     
2025-05-07T20:32:37.9240366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.9240714Z     
2025-05-07T20:32:37.9240910Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.9241235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.9241544Z         x = x_sign * x_clamp
2025-05-07T20:32:37.9241782Z         x0 = x[:, :D]
2025-05-07T20:32:37.9242007Z         x1 = x[:, D:]
2025-05-07T20:32:37.9242224Z     
2025-05-07T20:32:37.9242408Z         if contiguous:
2025-05-07T20:32:37.9242650Z             x0 = x0.contiguous()
2025-05-07T20:32:37.9242916Z             x1 = x1.contiguous()
2025-05-07T20:32:37.9243155Z     
2025-05-07T20:32:37.9243355Z         if scale_ub is not None:
2025-05-07T20:32:37.9243775Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.9244112Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.9244432Z             )
2025-05-07T20:32:37.9244629Z         else:
2025-05-07T20:32:37.9244833Z             scale_ub_tensor = None
2025-05-07T20:32:37.9245089Z     
2025-05-07T20:32:37.9245322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9245636Z             op = silu_mul_quant
2025-05-07T20:32:37.9245881Z             if compiled:
2025-05-07T20:32:37.9246127Z                 op = torch.compile(op)
2025-05-07T20:32:37.9246425Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9246697Z     
2025-05-07T20:32:37.9246895Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.9247058Z 
2025-05-07T20:32:37.9247164Z moe/activation_test.py:117: 
2025-05-07T20:32:37.9247461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9247794Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.9248177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9248869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.9249563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.9250102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.9250791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.9251453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.9251991Z     kernel = self.compile(
2025-05-07T20:32:37.9252541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.9253300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.9253776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9254018Z 
2025-05-07T20:32:37.9254228Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28790e4d0>
2025-05-07T20:32:37.9255314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9256732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a869b740>}
2025-05-07T20:32:37.9258084Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9259124Z context = <triton._C.libtriton.ir.context object at 0x7fb2879f29f0>
2025-05-07T20:32:37.9259493Z 
2025-05-07T20:32:37.9259664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9260194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9260664Z                            module_map=module_map)
2025-05-07T20:32:37.9261039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9261398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.9261656Z E       ^
2025-05-07T20:32:37.9262127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9262588Z 
2025-05-07T20:32:37.9263008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9263531Z 
2025-05-07T20:32:37.9263647Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.9264063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.9264478Z     T=128,
2025-05-07T20:32:37.9264671Z     D=5120,
2025-05-07T20:32:37.9264864Z     scale_ub=None,
2025-05-07T20:32:37.9265091Z     contiguous=False,
2025-05-07T20:32:37.9265324Z     compiled=True,
2025-05-07T20:32:37.9265539Z )
2025-05-07T20:32:37.9265856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.9266356Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.9266657Z 
2025-05-07T20:32:37.9266762Z     @given(
2025-05-07T20:32:37.9266985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.9267302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.9267611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.9267936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.9268268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.9268555Z     )
2025-05-07T20:32:37.9268961Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.9269406Z     def test_silu_mul_quant(
2025-05-07T20:32:37.9269646Z         self,
2025-05-07T20:32:37.9269846Z         T: int,
2025-05-07T20:32:37.9270037Z         D: int,
2025-05-07T20:32:37.9270263Z         scale_ub: Optional[float],
2025-05-07T20:32:37.9270538Z         contiguous: bool,
2025-05-07T20:32:37.9270771Z         compiled: bool,
2025-05-07T20:32:37.9270997Z     ) -> None:
2025-05-07T20:32:37.9271218Z         torch.manual_seed(2025)
2025-05-07T20:32:37.9271454Z     
2025-05-07T20:32:37.9271730Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.9272074Z     
2025-05-07T20:32:37.9272264Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.9272558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.9272873Z         x = x_sign * x_clamp
2025-05-07T20:32:37.9273152Z         x0 = x[:, :D]
2025-05-07T20:32:37.9273419Z         x1 = x[:, D:]
2025-05-07T20:32:37.9273630Z     
2025-05-07T20:32:37.9273810Z         if contiguous:
2025-05-07T20:32:37.9274043Z             x0 = x0.contiguous()
2025-05-07T20:32:37.9274308Z             x1 = x1.contiguous()
2025-05-07T20:32:37.9274552Z     
2025-05-07T20:32:37.9274742Z         if scale_ub is not None:
2025-05-07T20:32:37.9275018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.9275359Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.9275664Z             )
2025-05-07T20:32:37.9275858Z         else:
2025-05-07T20:32:37.9276070Z             scale_ub_tensor = None
2025-05-07T20:32:37.9276322Z     
2025-05-07T20:32:37.9276555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9276874Z             op = silu_mul_quant
2025-05-07T20:32:37.9277122Z             if compiled:
2025-05-07T20:32:37.9277372Z                 op = torch.compile(op)
2025-05-07T20:32:37.9277675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9277999Z     
2025-05-07T20:32:37.9278198Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.9278362Z 
2025-05-07T20:32:37.9278467Z moe/activation_test.py:117: 
2025-05-07T20:32:37.9278766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9279098Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.9279381Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9279944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.9280503Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.9281170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.9281868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.9282416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.9283104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.9283873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.9284413Z     kernel = self.compile(
2025-05-07T20:32:37.9284951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.9285615Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.9286013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9286241Z 
2025-05-07T20:32:37.9286454Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287bca350>
2025-05-07T20:32:37.9287582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9288959Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287d09120>}
2025-05-07T20:32:37.9290310Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9291344Z context = <triton._C.libtriton.ir.context object at 0x7fb287b0a7b0>
2025-05-07T20:32:37.9291632Z 
2025-05-07T20:32:37.9291806Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9292324Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9292795Z                            module_map=module_map)
2025-05-07T20:32:37.9293208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9293561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.9293863Z E       ^
2025-05-07T20:32:37.9294331Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9294784Z 
2025-05-07T20:32:37.9295213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9295731Z 
2025-05-07T20:32:37.9295838Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.9296303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.9296709Z     T=128,
2025-05-07T20:32:37.9296891Z     D=7168,
2025-05-07T20:32:37.9304258Z     scale_ub=1200.0,
2025-05-07T20:32:37.9304491Z     contiguous=False,
2025-05-07T20:32:37.9304722Z     compiled=False,
2025-05-07T20:32:37.9304932Z )
2025-05-07T20:32:38.0195305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.0196103Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.0196752Z 
2025-05-07T20:32:38.0196835Z     @given(
2025-05-07T20:32:38.0197083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.0197399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.0197716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.0198051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.0198379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.0198675Z     )
2025-05-07T20:32:38.0199031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.0199488Z     def test_silu_mul_quant(
2025-05-07T20:32:38.0199732Z         self,
2025-05-07T20:32:38.0199938Z         T: int,
2025-05-07T20:32:38.0200148Z         D: int,
2025-05-07T20:32:38.0200366Z         scale_ub: Optional[float],
2025-05-07T20:32:38.0200650Z         contiguous: bool,
2025-05-07T20:32:38.0200900Z         compiled: bool,
2025-05-07T20:32:38.0201140Z     ) -> None:
2025-05-07T20:32:38.0201366Z         torch.manual_seed(2025)
2025-05-07T20:32:38.0201619Z     
2025-05-07T20:32:38.0201897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.0202249Z     
2025-05-07T20:32:38.0202454Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.0202745Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.0203066Z         x = x_sign * x_clamp
2025-05-07T20:32:38.0203314Z         x0 = x[:, :D]
2025-05-07T20:32:38.0203688Z         x1 = x[:, D:]
2025-05-07T20:32:38.0203908Z     
2025-05-07T20:32:38.0204105Z         if contiguous:
2025-05-07T20:32:38.0204349Z             x0 = x0.contiguous()
2025-05-07T20:32:38.0204608Z             x1 = x1.contiguous()
2025-05-07T20:32:38.0204852Z     
2025-05-07T20:32:38.0205055Z         if scale_ub is not None:
2025-05-07T20:32:38.0205330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.0205786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.0206118Z             )
2025-05-07T20:32:38.0206315Z         else:
2025-05-07T20:32:38.0206535Z             scale_ub_tensor = None
2025-05-07T20:32:38.0206800Z     
2025-05-07T20:32:38.0207037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.0207362Z             op = silu_mul_quant
2025-05-07T20:32:38.0207623Z             if compiled:
2025-05-07T20:32:38.0207872Z                 op = torch.compile(op)
2025-05-07T20:32:38.0208184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.0208468Z     
2025-05-07T20:32:38.0208665Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.0208839Z 
2025-05-07T20:32:38.0208942Z moe/activation_test.py:117: 
2025-05-07T20:32:38.0209251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.0209593Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.0210003Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.0210712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.0211496Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.0212036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.0212728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.0213403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.0213949Z     kernel = self.compile(
2025-05-07T20:32:38.0214493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.0215168Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.0215578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.0215809Z 
2025-05-07T20:32:38.0216038Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287b42c10>
2025-05-07T20:32:38.0217204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.0218597Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287d08360>}
2025-05-07T20:32:38.0219944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.0220972Z context = <triton._C.libtriton.ir.context object at 0x7fb287bae5f0>
2025-05-07T20:32:38.0221268Z 
2025-05-07T20:32:38.0221444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.0221989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.0222464Z                            module_map=module_map)
2025-05-07T20:32:38.0222828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.0223188Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.0223454Z E       ^
2025-05-07T20:32:38.0223915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.0224370Z 
2025-05-07T20:32:38.0224789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.0225309Z 
2025-05-07T20:32:38.0225416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.0225833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.0226236Z     T=128,
2025-05-07T20:32:38.0226434Z     D=5120,
2025-05-07T20:32:38.0226690Z     scale_ub=None,
2025-05-07T20:32:38.0226909Z     contiguous=False,
2025-05-07T20:32:38.0227142Z     compiled=False,
2025-05-07T20:32:38.0227359Z )
2025-05-07T20:32:38.0227677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.0228182Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:38.0228452Z 
2025-05-07T20:32:38.0228540Z     @given(
2025-05-07T20:32:38.0228777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.0229091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.0229402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.0229734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.0230059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.0230347Z     )
2025-05-07T20:32:38.0230744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.0231228Z     def test_silu_mul_quant(
2025-05-07T20:32:38.0231479Z         self,
2025-05-07T20:32:38.0231678Z         T: int,
2025-05-07T20:32:38.0231879Z         D: int,
2025-05-07T20:32:38.0232101Z         scale_ub: Optional[float],
2025-05-07T20:32:38.0232382Z         contiguous: bool,
2025-05-07T20:32:38.0232620Z         compiled: bool,
2025-05-07T20:32:38.0232857Z     ) -> None:
2025-05-07T20:32:38.0233082Z         torch.manual_seed(2025)
2025-05-07T20:32:38.0233335Z     
2025-05-07T20:32:38.0233609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.0233959Z     
2025-05-07T20:32:38.0234162Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.0234452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.0234764Z         x = x_sign * x_clamp
2025-05-07T20:32:38.0235011Z         x0 = x[:, :D]
2025-05-07T20:32:38.0235228Z         x1 = x[:, D:]
2025-05-07T20:32:38.0235448Z     
2025-05-07T20:32:38.0235645Z         if contiguous:
2025-05-07T20:32:38.0235882Z             x0 = x0.contiguous()
2025-05-07T20:32:38.0236198Z             x1 = x1.contiguous()
2025-05-07T20:32:38.0236447Z     
2025-05-07T20:32:38.0236638Z         if scale_ub is not None:
2025-05-07T20:32:38.0236921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.0237263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.0237569Z             )
2025-05-07T20:32:38.0237773Z         else:
2025-05-07T20:32:38.0237986Z             scale_ub_tensor = None
2025-05-07T20:32:38.0238239Z     
2025-05-07T20:32:38.0238752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.0239073Z             op = silu_mul_quant
2025-05-07T20:32:38.0239330Z             if compiled:
2025-05-07T20:32:38.0239579Z                 op = torch.compile(op)
2025-05-07T20:32:38.0239884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.0240164Z     
2025-05-07T20:32:38.0240365Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.0240540Z 
2025-05-07T20:32:38.0240651Z moe/activation_test.py:117: 
2025-05-07T20:32:38.0240955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.0241286Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.0241572Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.0242262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.0242962Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.0243500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.0244311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.0244989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.0245525Z     kernel = self.compile(
2025-05-07T20:32:38.0246663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.0247345Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.0247748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.0247977Z 
2025-05-07T20:32:38.0248186Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287891410>
2025-05-07T20:32:38.0249260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.0250624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28785c720>}
2025-05-07T20:32:38.0252039Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.0253128Z context = <triton._C.libtriton.ir.context object at 0x7fb287811a30>
2025-05-07T20:32:38.0253416Z 
2025-05-07T20:32:38.0253586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.0254115Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.0254587Z                            module_map=module_map)
2025-05-07T20:32:38.0254951Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.0255316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.0255579Z E       ^
2025-05-07T20:32:38.0256054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.0256506Z 
2025-05-07T20:32:38.0256932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.0257526Z 
2025-05-07T20:32:38.0257631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.0258051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.0258449Z     T=128,
2025-05-07T20:32:38.0258642Z     D=5120,
2025-05-07T20:32:38.0258841Z     scale_ub=1200.0,
2025-05-07T20:32:38.0259069Z     contiguous=True,
2025-05-07T20:32:38.0259294Z     compiled=False,
2025-05-07T20:32:38.0259502Z )
2025-05-07T20:32:38.3387855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.3388613Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:38.3388986Z 
2025-05-07T20:32:38.3389105Z     @given(
2025-05-07T20:32:38.3389344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.3389685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.3390002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.3390342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.3390676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.3390964Z     )
2025-05-07T20:32:38.3391327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.3391774Z     def test_silu_mul_quant(
2025-05-07T20:32:38.3392020Z         self,
2025-05-07T20:32:38.3392216Z         T: int,
2025-05-07T20:32:38.3392421Z         D: int,
2025-05-07T20:32:38.3392646Z         scale_ub: Optional[float],
2025-05-07T20:32:38.3392915Z         contiguous: bool,
2025-05-07T20:32:38.3393163Z         compiled: bool,
2025-05-07T20:32:38.3393397Z     ) -> None:
2025-05-07T20:32:38.3393613Z         torch.manual_seed(2025)
2025-05-07T20:32:38.3393863Z     
2025-05-07T20:32:38.3394146Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.3394504Z     
2025-05-07T20:32:38.3394697Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.3395321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.3395641Z         x = x_sign * x_clamp
2025-05-07T20:32:38.3395879Z         x0 = x[:, :D]
2025-05-07T20:32:38.3396100Z         x1 = x[:, D:]
2025-05-07T20:32:38.3396310Z     
2025-05-07T20:32:38.3396497Z         if contiguous:
2025-05-07T20:32:38.3396735Z             x0 = x0.contiguous()
2025-05-07T20:32:38.3396995Z             x1 = x1.contiguous()
2025-05-07T20:32:38.3397237Z     
2025-05-07T20:32:38.3397433Z         if scale_ub is not None:
2025-05-07T20:32:38.3397709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.3398039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.3398348Z             )
2025-05-07T20:32:38.3398548Z         else:
2025-05-07T20:32:38.3398756Z             scale_ub_tensor = None
2025-05-07T20:32:38.3399014Z     
2025-05-07T20:32:38.3399347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.3399664Z             op = silu_mul_quant
2025-05-07T20:32:38.3400009Z             if compiled:
2025-05-07T20:32:38.3400260Z                 op = torch.compile(op)
2025-05-07T20:32:38.3400562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.3400835Z     
2025-05-07T20:32:38.3401031Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.3401195Z 
2025-05-07T20:32:38.3401300Z moe/activation_test.py:117: 
2025-05-07T20:32:38.3401594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.3401928Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.3402215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.3402911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.3403764Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.3404314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.3405011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.3405771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.3406365Z     kernel = self.compile(
2025-05-07T20:32:38.3406918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.3407587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.3407983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.3408220Z 
2025-05-07T20:32:38.3408428Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2878cc410>
2025-05-07T20:32:38.3409514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.3410895Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28785d8a0>}
2025-05-07T20:32:38.3412228Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.3413259Z context = <triton._C.libtriton.ir.context object at 0x7fb287864a70>
2025-05-07T20:32:38.3413555Z 
2025-05-07T20:32:38.3413723Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.3414250Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.3414714Z                            module_map=module_map)
2025-05-07T20:32:38.3415084Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.3415529Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.3415794Z E       ^
2025-05-07T20:32:38.3416263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.3416724Z 
2025-05-07T20:32:38.3417144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.3417659Z 
2025-05-07T20:32:38.3417772Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.3418180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.3418582Z     T=1,
2025-05-07T20:32:38.3418772Z     D=7168,
2025-05-07T20:32:38.3418963Z     scale_ub=1200.0,
2025-05-07T20:32:38.3419186Z     contiguous=True,
2025-05-07T20:32:38.3419411Z     compiled=True,
2025-05-07T20:32:38.3419614Z )
2025-05-07T20:32:38.3419989Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.3420487Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:38.3420797Z 
2025-05-07T20:32:38.3420884Z     @given(
2025-05-07T20:32:38.3421108Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.3421428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.3421740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.3422065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.3422401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.3422691Z     )
2025-05-07T20:32:38.3423034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.3423479Z     def test_silu_mul_quant(
2025-05-07T20:32:38.3423725Z         self,
2025-05-07T20:32:38.3423918Z         T: int,
2025-05-07T20:32:38.3424111Z         D: int,
2025-05-07T20:32:38.3424329Z         scale_ub: Optional[float],
2025-05-07T20:32:38.3424607Z         contiguous: bool,
2025-05-07T20:32:38.3424843Z         compiled: bool,
2025-05-07T20:32:38.3425121Z     ) -> None:
2025-05-07T20:32:38.3425340Z         torch.manual_seed(2025)
2025-05-07T20:32:38.3425577Z     
2025-05-07T20:32:38.3425852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.3426200Z     
2025-05-07T20:32:38.3426391Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.3426688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.3427000Z         x = x_sign * x_clamp
2025-05-07T20:32:38.3427236Z         x0 = x[:, :D]
2025-05-07T20:32:38.3427456Z         x1 = x[:, D:]
2025-05-07T20:32:38.3427668Z     
2025-05-07T20:32:38.3427854Z         if contiguous:
2025-05-07T20:32:38.3428089Z             x0 = x0.contiguous()
2025-05-07T20:32:38.3428355Z             x1 = x1.contiguous()
2025-05-07T20:32:38.3428595Z     
2025-05-07T20:32:38.3428793Z         if scale_ub is not None:
2025-05-07T20:32:38.3429075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.3429414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.3429723Z             )
2025-05-07T20:32:38.3429916Z         else:
2025-05-07T20:32:38.3430129Z             scale_ub_tensor = None
2025-05-07T20:32:38.3430380Z     
2025-05-07T20:32:38.3430615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.3430929Z             op = silu_mul_quant
2025-05-07T20:32:38.3431173Z             if compiled:
2025-05-07T20:32:38.3431420Z                 op = torch.compile(op)
2025-05-07T20:32:38.3431716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.3431986Z     
2025-05-07T20:32:38.3432180Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.3432341Z 
2025-05-07T20:32:38.3432448Z moe/activation_test.py:117: 
2025-05-07T20:32:38.3432736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.3433067Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.3433351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.3433962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.3434528Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.3435191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.3435891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.3436428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.3437184Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.3437854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.3438661Z     kernel = self.compile(
2025-05-07T20:32:38.3439282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.3439962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.3440422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.3440650Z 
2025-05-07T20:32:38.3440864Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287a33850>
2025-05-07T20:32:38.3441935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.3443298Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28785ee80>}
2025-05-07T20:32:38.3444750Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.3445780Z context = <triton._C.libtriton.ir.context object at 0x7fb287aafeb0>
2025-05-07T20:32:38.3446137Z 
2025-05-07T20:32:38.3446312Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.3446829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.3447299Z                            module_map=module_map)
2025-05-07T20:32:38.3447665Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.3448015Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.3448276Z E       ^
2025-05-07T20:32:38.3448741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.3449191Z 
2025-05-07T20:32:38.3449617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.3450134Z 
2025-05-07T20:32:38.3450242Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.3450660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.3451061Z     T=1,
2025-05-07T20:32:38.3451240Z     D=7168,
2025-05-07T20:32:38.3451436Z     scale_ub=1200.0,
2025-05-07T20:32:38.3451659Z     contiguous=False,
2025-05-07T20:32:38.3451879Z     compiled=True,
2025-05-07T20:32:38.3452090Z )
2025-05-07T20:32:38.4474035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.4475455Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.4476186Z 
2025-05-07T20:32:38.4476397Z     @given(
2025-05-07T20:32:38.4476853Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.4477278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.4477590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.4477932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.4478556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.4478895Z     )
2025-05-07T20:32:38.4479291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.4479816Z     def test_silu_mul_quant(
2025-05-07T20:32:38.4480079Z         self,
2025-05-07T20:32:38.4480283Z         T: int,
2025-05-07T20:32:38.4480494Z         D: int,
2025-05-07T20:32:38.4480728Z         scale_ub: Optional[float],
2025-05-07T20:32:38.4481023Z         contiguous: bool,
2025-05-07T20:32:38.4481284Z         compiled: bool,
2025-05-07T20:32:38.4481527Z     ) -> None:
2025-05-07T20:32:38.4481753Z         torch.manual_seed(2025)
2025-05-07T20:32:38.4482022Z     
2025-05-07T20:32:38.4482321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.4482707Z     
2025-05-07T20:32:38.4482918Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.4483324Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.4483773Z         x = x_sign * x_clamp
2025-05-07T20:32:38.4484106Z         x0 = x[:, :D]
2025-05-07T20:32:38.4484327Z         x1 = x[:, D:]
2025-05-07T20:32:38.4484543Z     
2025-05-07T20:32:38.4484729Z         if contiguous:
2025-05-07T20:32:38.4484967Z             x0 = x0.contiguous()
2025-05-07T20:32:38.4485233Z             x1 = x1.contiguous()
2025-05-07T20:32:38.4485475Z     
2025-05-07T20:32:38.4485673Z         if scale_ub is not None:
2025-05-07T20:32:38.4485952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.4486332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.4494403Z             )
2025-05-07T20:32:38.4494612Z         else:
2025-05-07T20:32:38.4494832Z             scale_ub_tensor = None
2025-05-07T20:32:38.4495098Z     
2025-05-07T20:32:38.4495330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.4495652Z             op = silu_mul_quant
2025-05-07T20:32:38.4495913Z             if compiled:
2025-05-07T20:32:38.4496163Z                 op = torch.compile(op)
2025-05-07T20:32:38.4496619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.4496900Z     
2025-05-07T20:32:38.4497097Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.4497275Z 
2025-05-07T20:32:38.4497378Z moe/activation_test.py:117: 
2025-05-07T20:32:38.4497684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.4498017Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.4498312Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.4498882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.4499445Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.4500094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.4500787Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.4501334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.4502027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.4502693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.4503231Z     kernel = self.compile(
2025-05-07T20:32:38.4503777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.4504432Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.4504850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.4505087Z 
2025-05-07T20:32:38.4505294Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287a39550>
2025-05-07T20:32:38.4506426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.4507817Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287af4680>}
2025-05-07T20:32:38.4509149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.4510178Z context = <triton._C.libtriton.ir.context object at 0x7fb287a29bb0>
2025-05-07T20:32:38.4510474Z 
2025-05-07T20:32:38.4510641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.4511165Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.4511674Z                            module_map=module_map)
2025-05-07T20:32:38.4512049Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.4512456Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.4512707Z E       ^
2025-05-07T20:32:38.4513174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.4513630Z 
2025-05-07T20:32:38.4514050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.4514562Z 
2025-05-07T20:32:38.4514678Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.4515087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.4515489Z     T=1,
2025-05-07T20:32:38.4515675Z     D=7168,
2025-05-07T20:32:38.4515866Z     scale_ub=None,
2025-05-07T20:32:38.4516082Z     contiguous=False,
2025-05-07T20:32:38.4516309Z     compiled=True,
2025-05-07T20:32:38.4516511Z )
2025-05-07T20:32:38.5190308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.5191374Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:38.5191649Z 
2025-05-07T20:32:38.5191739Z     @given(
2025-05-07T20:32:38.5191975Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.5192294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.5192606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.5192937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.5193264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.5193553Z     )
2025-05-07T20:32:38.5193904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.5194344Z     def test_silu_mul_quant(
2025-05-07T20:32:38.5194597Z         self,
2025-05-07T20:32:38.5194799Z         T: int,
2025-05-07T20:32:38.5194995Z         D: int,
2025-05-07T20:32:38.5195226Z         scale_ub: Optional[float],
2025-05-07T20:32:38.5195505Z         contiguous: bool,
2025-05-07T20:32:38.5195752Z         compiled: bool,
2025-05-07T20:32:38.5195987Z     ) -> None:
2025-05-07T20:32:38.5196212Z         torch.manual_seed(2025)
2025-05-07T20:32:38.5196452Z     
2025-05-07T20:32:38.5196740Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.5197078Z     
2025-05-07T20:32:38.5197277Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.5197571Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.5197876Z         x = x_sign * x_clamp
2025-05-07T20:32:38.5198121Z         x0 = x[:, :D]
2025-05-07T20:32:38.5198341Z         x1 = x[:, D:]
2025-05-07T20:32:38.5198547Z     
2025-05-07T20:32:38.5198737Z         if contiguous:
2025-05-07T20:32:38.5198975Z             x0 = x0.contiguous()
2025-05-07T20:32:38.5199228Z             x1 = x1.contiguous()
2025-05-07T20:32:38.5199468Z     
2025-05-07T20:32:38.5199667Z         if scale_ub is not None:
2025-05-07T20:32:38.5199937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.5200379Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.5200695Z             )
2025-05-07T20:32:38.5200889Z         else:
2025-05-07T20:32:38.5201097Z             scale_ub_tensor = None
2025-05-07T20:32:38.5201350Z     
2025-05-07T20:32:38.5201590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.5201898Z             op = silu_mul_quant
2025-05-07T20:32:38.5202150Z             if compiled:
2025-05-07T20:32:38.5202399Z                 op = torch.compile(op)
2025-05-07T20:32:38.5202691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5202970Z     
2025-05-07T20:32:38.5203166Z         y_fp8, y_scale = fn()
2025-05-07T20:32:38.5203445Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:38.5203906Z     
2025-05-07T20:32:38.5204151Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.5204569Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:38.5204876Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:38.5205278Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:38.5205641Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.5205949Z     
2025-05-07T20:32:38.5206152Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:38.5206347Z 
2025-05-07T20:32:38.5206454Z moe/activation_test.py:126: 
2025-05-07T20:32:38.5206745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5207084Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:38.5207414Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.5208202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:38.5208962Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:38.5209518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.5210261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.5210950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:38.5211682Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.5212445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:38.5213201Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.5213931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:38.5214580Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:38.5215199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:38.5215734Z     fn()
2025-05-07T20:32:38.5216246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:38.5216833Z     self.fn.run(
2025-05-07T20:32:38.5217307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.5217838Z     kernel = self.compile(
2025-05-07T20:32:38.5218386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.5219048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.5219450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5219679Z 
2025-05-07T20:32:38.5219887Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287c63cd0>
2025-05-07T20:32:38.5221015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.5222403Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287af5580>}
2025-05-07T20:32:38.5223752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.5224777Z context = <triton._C.libtriton.ir.context object at 0x7fb287c40330>
2025-05-07T20:32:38.5225070Z 
2025-05-07T20:32:38.5225240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.5225810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.5226323Z                            module_map=module_map)
2025-05-07T20:32:38.5226690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.5227054Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:38.5227325Z E       ^
2025-05-07T20:32:38.5227788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.5228248Z 
2025-05-07T20:32:38.5228670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.5229194Z 
2025-05-07T20:32:38.5229301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.5229718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.5230116Z     T=1,
2025-05-07T20:32:38.5230302Z     D=5120,
2025-05-07T20:32:38.5230504Z     scale_ub=1200.0,
2025-05-07T20:32:38.5230728Z     contiguous=False,
2025-05-07T20:32:38.5230958Z     compiled=True,
2025-05-07T20:32:38.5231176Z )
2025-05-07T20:32:38.6447298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.6448015Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.6448406Z 
2025-05-07T20:32:38.6448515Z     @given(
2025-05-07T20:32:38.6448834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.6449266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.6449596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.6449927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.6450258Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.6450542Z     )
2025-05-07T20:32:38.6450896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.6451342Z     def test_silu_mul_quant(
2025-05-07T20:32:38.6451582Z         self,
2025-05-07T20:32:38.6451798Z         T: int,
2025-05-07T20:32:38.6452003Z         D: int,
2025-05-07T20:32:38.6452235Z         scale_ub: Optional[float],
2025-05-07T20:32:38.6452511Z         contiguous: bool,
2025-05-07T20:32:38.6452753Z         compiled: bool,
2025-05-07T20:32:38.6452977Z     ) -> None:
2025-05-07T20:32:38.6453198Z         torch.manual_seed(2025)
2025-05-07T20:32:38.6453447Z     
2025-05-07T20:32:38.6453718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.6454068Z     
2025-05-07T20:32:38.6454267Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.6454592Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.6454902Z         x = x_sign * x_clamp
2025-05-07T20:32:38.6455139Z         x0 = x[:, :D]
2025-05-07T20:32:38.6455359Z         x1 = x[:, D:]
2025-05-07T20:32:38.6455574Z     
2025-05-07T20:32:38.6455759Z         if contiguous:
2025-05-07T20:32:38.6455995Z             x0 = x0.contiguous()
2025-05-07T20:32:38.6456277Z             x1 = x1.contiguous()
2025-05-07T20:32:38.6456516Z     
2025-05-07T20:32:38.6456975Z         if scale_ub is not None:
2025-05-07T20:32:38.6457261Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.6457596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.6457907Z             )
2025-05-07T20:32:38.6458106Z         else:
2025-05-07T20:32:38.6458315Z             scale_ub_tensor = None
2025-05-07T20:32:38.6458571Z     
2025-05-07T20:32:38.6458810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6459123Z             op = silu_mul_quant
2025-05-07T20:32:38.6459379Z             if compiled:
2025-05-07T20:32:38.6459637Z                 op = torch.compile(op)
2025-05-07T20:32:38.6459930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.6460211Z     
2025-05-07T20:32:38.6460414Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.6460579Z 
2025-05-07T20:32:38.6460687Z moe/activation_test.py:117: 
2025-05-07T20:32:38.6461064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6461484Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.6461776Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.6462337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.6462905Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.6463572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.6464266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.6464803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.6465493Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.6466165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.6466700Z     kernel = self.compile(
2025-05-07T20:32:38.6467252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.6468004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.6468407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6468637Z 
2025-05-07T20:32:38.6468845Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287c3c590>
2025-05-07T20:32:38.6469929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.6471317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287af6b60>}
2025-05-07T20:32:38.6472673Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.6473716Z context = <triton._C.libtriton.ir.context object at 0x7fb287c98bf0>
2025-05-07T20:32:38.6474004Z 
2025-05-07T20:32:38.6474174Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.6474704Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.6475181Z                            module_map=module_map)
2025-05-07T20:32:38.6475541Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.6475903Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.6476170Z E       ^
2025-05-07T20:32:38.6476635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.6477099Z 
2025-05-07T20:32:38.6477566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.6478097Z 
2025-05-07T20:32:38.6478202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.6478623Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.6479022Z     T=1,
2025-05-07T20:32:38.6479214Z     D=5120,
2025-05-07T20:32:38.6479415Z     scale_ub=1200.0,
2025-05-07T20:32:38.6479638Z     contiguous=False,
2025-05-07T20:32:38.6479868Z     compiled=False,
2025-05-07T20:32:38.6480084Z )
2025-05-07T20:32:38.6480411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.6480901Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.6481178Z 
2025-05-07T20:32:38.6481256Z     @given(
2025-05-07T20:32:38.6481489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.6481847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.6482238Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.6482575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.6482906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.6483202Z     )
2025-05-07T20:32:38.6483735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.6484184Z     def test_silu_mul_quant(
2025-05-07T20:32:38.6484427Z         self,
2025-05-07T20:32:38.6484634Z         T: int,
2025-05-07T20:32:38.6484837Z         D: int,
2025-05-07T20:32:38.6485059Z         scale_ub: Optional[float],
2025-05-07T20:32:38.6485336Z         contiguous: bool,
2025-05-07T20:32:38.6485580Z         compiled: bool,
2025-05-07T20:32:38.6485800Z     ) -> None:
2025-05-07T20:32:38.6486021Z         torch.manual_seed(2025)
2025-05-07T20:32:38.6486272Z     
2025-05-07T20:32:38.6486549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.6486897Z     
2025-05-07T20:32:38.6487098Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.6487443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.6487757Z         x = x_sign * x_clamp
2025-05-07T20:32:38.6488000Z         x0 = x[:, :D]
2025-05-07T20:32:38.6488218Z         x1 = x[:, D:]
2025-05-07T20:32:38.6488432Z     
2025-05-07T20:32:38.6488624Z         if contiguous:
2025-05-07T20:32:38.6488859Z             x0 = x0.contiguous()
2025-05-07T20:32:38.6489120Z             x1 = x1.contiguous()
2025-05-07T20:32:38.6489367Z     
2025-05-07T20:32:38.6489562Z         if scale_ub is not None:
2025-05-07T20:32:38.6489834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.6490171Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.6490479Z             )
2025-05-07T20:32:38.6490670Z         else:
2025-05-07T20:32:38.6490889Z             scale_ub_tensor = None
2025-05-07T20:32:38.6491146Z     
2025-05-07T20:32:38.6491380Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6491702Z             op = silu_mul_quant
2025-05-07T20:32:38.6491966Z             if compiled:
2025-05-07T20:32:38.6492211Z                 op = torch.compile(op)
2025-05-07T20:32:38.6492519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.6492798Z     
2025-05-07T20:32:38.6492990Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.6493161Z 
2025-05-07T20:32:38.6493266Z moe/activation_test.py:117: 
2025-05-07T20:32:38.6493578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6493916Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.6494197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.6494896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.6495601Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.6496211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.6496925Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.6497603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.6498148Z     kernel = self.compile(
2025-05-07T20:32:38.6498691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.6499362Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.6499773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6500003Z 
2025-05-07T20:32:38.6500223Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287ca5d50>
2025-05-07T20:32:38.6501344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.6502821Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287af72e0>}
2025-05-07T20:32:38.6504170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.6505208Z context = <triton._C.libtriton.ir.context object at 0x7fb287cf6370>
2025-05-07T20:32:38.6505495Z 
2025-05-07T20:32:38.6505665Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.6506192Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.6506664Z                            module_map=module_map)
2025-05-07T20:32:38.6507039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.6507441Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.6507705Z E       ^
2025-05-07T20:32:38.6508174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.6508628Z 
2025-05-07T20:32:38.6509049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.6509575Z 
2025-05-07T20:32:38.6509679Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.6510097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.6510505Z     T=16384,
2025-05-07T20:32:38.6510698Z     D=5120,
2025-05-07T20:32:38.6510893Z     scale_ub=1200.0,
2025-05-07T20:32:38.6511123Z     contiguous=False,
2025-05-07T20:32:38.6511343Z     compiled=True,
2025-05-07T20:32:38.6511553Z )
2025-05-07T20:32:38.8973730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.8974528Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.8974932Z 
2025-05-07T20:32:38.8975037Z     @given(
2025-05-07T20:32:38.8975270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.8975591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.8975902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.8976246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.8976610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.8976899Z     )
2025-05-07T20:32:38.8977252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.8977689Z     def test_silu_mul_quant(
2025-05-07T20:32:38.8977935Z         self,
2025-05-07T20:32:38.8978132Z         T: int,
2025-05-07T20:32:38.8978330Z         D: int,
2025-05-07T20:32:38.8978555Z         scale_ub: Optional[float],
2025-05-07T20:32:38.8978832Z         contiguous: bool,
2025-05-07T20:32:38.8979358Z         compiled: bool,
2025-05-07T20:32:38.8979598Z     ) -> None:
2025-05-07T20:32:38.8979816Z         torch.manual_seed(2025)
2025-05-07T20:32:38.8980054Z     
2025-05-07T20:32:38.8980333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.8980676Z     
2025-05-07T20:32:38.8980866Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.8981164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.8981475Z         x = x_sign * x_clamp
2025-05-07T20:32:38.8981717Z         x0 = x[:, :D]
2025-05-07T20:32:38.8981926Z         x1 = x[:, D:]
2025-05-07T20:32:38.8982137Z     
2025-05-07T20:32:38.8982327Z         if contiguous:
2025-05-07T20:32:38.8982556Z             x0 = x0.contiguous()
2025-05-07T20:32:38.8982816Z             x1 = x1.contiguous()
2025-05-07T20:32:38.8983055Z     
2025-05-07T20:32:38.8983243Z         if scale_ub is not None:
2025-05-07T20:32:38.8983602Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.8984018Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.8984324Z             )
2025-05-07T20:32:38.8984519Z         else:
2025-05-07T20:32:38.8984731Z             scale_ub_tensor = None
2025-05-07T20:32:38.8984978Z     
2025-05-07T20:32:38.8985214Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.8985530Z             op = silu_mul_quant
2025-05-07T20:32:38.8985777Z             if compiled:
2025-05-07T20:32:38.8986026Z                 op = torch.compile(op)
2025-05-07T20:32:38.8986324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.8986602Z     
2025-05-07T20:32:38.8986788Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.8986960Z 
2025-05-07T20:32:38.8987058Z moe/activation_test.py:117: 
2025-05-07T20:32:38.8987352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.8987679Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.8987969Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.8988631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.8989193Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.8989856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.8990545Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.8991084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.8999203Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.8999925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.9000466Z     kernel = self.compile(
2025-05-07T20:32:38.9001027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.9001708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.9002109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9002349Z 
2025-05-07T20:32:38.9002560Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2874051d0>
2025-05-07T20:32:38.9003745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.9005128Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2874e0fe0>}
2025-05-07T20:32:38.9006571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.9007622Z context = <triton._C.libtriton.ir.context object at 0x7fb2874097f0>
2025-05-07T20:32:38.9007917Z 
2025-05-07T20:32:38.9008083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.9008614Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.9009088Z                            module_map=module_map)
2025-05-07T20:32:38.9009452Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.9009817Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.9010083Z E       ^
2025-05-07T20:32:38.9010546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.9011007Z 
2025-05-07T20:32:38.9011472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.9012035Z 
2025-05-07T20:32:38.9012146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.9012565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.9012963Z     T=2048,
2025-05-07T20:32:38.9013160Z     D=7168,
2025-05-07T20:32:38.9013355Z     scale_ub=1200.0,
2025-05-07T20:32:38.9013585Z     contiguous=False,
2025-05-07T20:32:38.9013817Z     compiled=True,
2025-05-07T20:32:38.9014030Z )
2025-05-07T20:32:38.9014346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.9014844Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.9015126Z 
2025-05-07T20:32:38.9015205Z     @given(
2025-05-07T20:32:38.9015439Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.9015747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.9016054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.9016389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.9016797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.9017086Z     )
2025-05-07T20:32:38.9017436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.9017872Z     def test_silu_mul_quant(
2025-05-07T20:32:38.9018115Z         self,
2025-05-07T20:32:38.9018313Z         T: int,
2025-05-07T20:32:38.9018507Z         D: int,
2025-05-07T20:32:38.9018733Z         scale_ub: Optional[float],
2025-05-07T20:32:38.9019005Z         contiguous: bool,
2025-05-07T20:32:38.9019241Z         compiled: bool,
2025-05-07T20:32:38.9019468Z     ) -> None:
2025-05-07T20:32:38.9019688Z         torch.manual_seed(2025)
2025-05-07T20:32:38.9019929Z     
2025-05-07T20:32:38.9020200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.9020533Z     
2025-05-07T20:32:38.9020729Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.9021016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.9021333Z         x = x_sign * x_clamp
2025-05-07T20:32:38.9021582Z         x0 = x[:, :D]
2025-05-07T20:32:38.9021790Z         x1 = x[:, D:]
2025-05-07T20:32:38.9021998Z     
2025-05-07T20:32:38.9022187Z         if contiguous:
2025-05-07T20:32:38.9022416Z             x0 = x0.contiguous()
2025-05-07T20:32:38.9022674Z             x1 = x1.contiguous()
2025-05-07T20:32:38.9022915Z     
2025-05-07T20:32:38.9023104Z         if scale_ub is not None:
2025-05-07T20:32:38.9023377Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.9023712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.9024010Z             )
2025-05-07T20:32:38.9024209Z         else:
2025-05-07T20:32:38.9024425Z             scale_ub_tensor = None
2025-05-07T20:32:38.9024680Z     
2025-05-07T20:32:38.9024910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.9025229Z             op = silu_mul_quant
2025-05-07T20:32:38.9025487Z             if compiled:
2025-05-07T20:32:38.9025780Z                 op = torch.compile(op)
2025-05-07T20:32:38.9026087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9026364Z     
2025-05-07T20:32:38.9026553Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.9026723Z 
2025-05-07T20:32:38.9026822Z moe/activation_test.py:117: 
2025-05-07T20:32:38.9027120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9027445Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.9027729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9028294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.9028854Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.9029521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.9030260Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.9030798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.9031525Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.9032195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.9032729Z     kernel = self.compile(
2025-05-07T20:32:38.9033265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.9033924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.9034319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9034544Z 
2025-05-07T20:32:38.9034755Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2874926d0>
2025-05-07T20:32:38.9035822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.9037278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2874e1b20>}
2025-05-07T20:32:38.9038929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.9039955Z context = <triton._C.libtriton.ir.context object at 0x7fb2874a6cf0>
2025-05-07T20:32:38.9040238Z 
2025-05-07T20:32:38.9040404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.9040927Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.9041395Z                            module_map=module_map)
2025-05-07T20:32:38.9041768Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.9042117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.9042373Z E       ^
2025-05-07T20:32:38.9042841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.9043291Z 
2025-05-07T20:32:38.9043868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.9044389Z 
2025-05-07T20:32:38.9944410Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.9945057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.9945629Z     T=1,
2025-05-07T20:32:38.9945887Z     D=5120,
2025-05-07T20:32:38.9946154Z     scale_ub=None,
2025-05-07T20:32:38.9946470Z     contiguous=False,
2025-05-07T20:32:38.9946733Z     compiled=False,
2025-05-07T20:32:38.9946963Z )
2025-05-07T20:32:38.9947503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.9948020Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:38.9948287Z 
2025-05-07T20:32:38.9948374Z     @given(
2025-05-07T20:32:38.9948608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.9948930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.9949244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.9949573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.9949908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.9950203Z     )
2025-05-07T20:32:38.9950556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.9951009Z     def test_silu_mul_quant(
2025-05-07T20:32:38.9951259Z         self,
2025-05-07T20:32:38.9951456Z         T: int,
2025-05-07T20:32:38.9951760Z         D: int,
2025-05-07T20:32:38.9951989Z         scale_ub: Optional[float],
2025-05-07T20:32:38.9952353Z         contiguous: bool,
2025-05-07T20:32:38.9952593Z         compiled: bool,
2025-05-07T20:32:38.9952854Z     ) -> None:
2025-05-07T20:32:38.9953076Z         torch.manual_seed(2025)
2025-05-07T20:32:38.9953322Z     
2025-05-07T20:32:38.9953595Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.9953942Z     
2025-05-07T20:32:38.9954145Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.9954446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.9954756Z         x = x_sign * x_clamp
2025-05-07T20:32:38.9955006Z         x0 = x[:, :D]
2025-05-07T20:32:38.9955231Z         x1 = x[:, D:]
2025-05-07T20:32:38.9955443Z     
2025-05-07T20:32:38.9955636Z         if contiguous:
2025-05-07T20:32:38.9955876Z             x0 = x0.contiguous()
2025-05-07T20:32:38.9956134Z             x1 = x1.contiguous()
2025-05-07T20:32:38.9956383Z     
2025-05-07T20:32:38.9956584Z         if scale_ub is not None:
2025-05-07T20:32:38.9956863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.9957299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.9957619Z             )
2025-05-07T20:32:38.9957812Z         else:
2025-05-07T20:32:38.9958038Z             scale_ub_tensor = None
2025-05-07T20:32:38.9958307Z     
2025-05-07T20:32:38.9958545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.9958871Z             op = silu_mul_quant
2025-05-07T20:32:38.9959134Z             if compiled:
2025-05-07T20:32:38.9959391Z                 op = torch.compile(op)
2025-05-07T20:32:38.9959693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9959978Z     
2025-05-07T20:32:38.9960180Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.9960348Z 
2025-05-07T20:32:38.9960453Z moe/activation_test.py:117: 
2025-05-07T20:32:38.9960757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9961098Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.9961391Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9962096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.9962798Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.9963345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.9964185Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.9964860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.9965400Z     kernel = self.compile(
2025-05-07T20:32:38.9965945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.9966615Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.9967068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9967304Z 
2025-05-07T20:32:38.9967524Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8bfc110>
2025-05-07T20:32:38.9968597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.9969979Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2874e2e80>}
2025-05-07T20:32:38.9971322Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.9972396Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8ba4670>
2025-05-07T20:32:38.9972728Z 
2025-05-07T20:32:38.9972906Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.9973426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.9973899Z                            module_map=module_map)
2025-05-07T20:32:38.9974273Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.9974649Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.9974909Z E       ^
2025-05-07T20:32:38.9975377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.9975848Z 
2025-05-07T20:32:38.9976271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.9976788Z 
2025-05-07T20:32:38.9976900Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.9977319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.9977774Z     T=4096,
2025-05-07T20:32:38.9977972Z     D=7168,
2025-05-07T20:32:38.9978163Z     scale_ub=1200.0,
2025-05-07T20:32:38.9978399Z     contiguous=False,
2025-05-07T20:32:38.9978630Z     compiled=False,
2025-05-07T20:32:38.9978833Z )
2025-05-07T20:32:38.9979159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.9979665Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.9979945Z 
2025-05-07T20:32:38.9980032Z     @given(
2025-05-07T20:32:38.9980264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.9980580Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.9980891Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.9981219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.9981552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.9981838Z     )
2025-05-07T20:32:38.9982194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.9982643Z     def test_silu_mul_quant(
2025-05-07T20:32:38.9982890Z         self,
2025-05-07T20:32:38.9983087Z         T: int,
2025-05-07T20:32:38.9983282Z         D: int,
2025-05-07T20:32:38.9983509Z         scale_ub: Optional[float],
2025-05-07T20:32:38.9983785Z         contiguous: bool,
2025-05-07T20:32:38.9984022Z         compiled: bool,
2025-05-07T20:32:38.9984251Z     ) -> None:
2025-05-07T20:32:38.9984471Z         torch.manual_seed(2025)
2025-05-07T20:32:38.9984710Z     
2025-05-07T20:32:38.9984994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.9985342Z     
2025-05-07T20:32:38.9985538Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.9985835Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.9986149Z         x = x_sign * x_clamp
2025-05-07T20:32:38.9986392Z         x0 = x[:, :D]
2025-05-07T20:32:38.9986665Z         x1 = x[:, D:]
2025-05-07T20:32:38.9986880Z     
2025-05-07T20:32:38.9987066Z         if contiguous:
2025-05-07T20:32:38.9987306Z             x0 = x0.contiguous()
2025-05-07T20:32:38.9987570Z             x1 = x1.contiguous()
2025-05-07T20:32:38.9987814Z     
2025-05-07T20:32:38.9988013Z         if scale_ub is not None:
2025-05-07T20:32:38.9988292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.9988634Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.9988940Z             )
2025-05-07T20:32:38.9989143Z         else:
2025-05-07T20:32:38.9989361Z             scale_ub_tensor = None
2025-05-07T20:32:38.9989608Z     
2025-05-07T20:32:38.9989846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.9990167Z             op = silu_mul_quant
2025-05-07T20:32:38.9990414Z             if compiled:
2025-05-07T20:32:38.9990674Z                 op = torch.compile(op)
2025-05-07T20:32:38.9991042Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9991388Z     
2025-05-07T20:32:38.9991590Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.9991756Z 
2025-05-07T20:32:38.9991865Z moe/activation_test.py:117: 
2025-05-07T20:32:38.9992159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9992501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.9992796Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9993498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.9994188Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.9994738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.9995439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.9996115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.9996707Z     kernel = self.compile(
2025-05-07T20:32:38.9997261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.9997928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.9998320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9998558Z 
2025-05-07T20:32:38.9998768Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8b43050>
2025-05-07T20:32:38.9999847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.0001217Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8ba8040>}
2025-05-07T20:32:39.0002575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.0003694Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8b93670>
2025-05-07T20:32:39.0003992Z 
2025-05-07T20:32:39.0004162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.0004693Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.0005161Z                            module_map=module_map)
2025-05-07T20:32:39.0005534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.0005892Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.0006160Z E       ^
2025-05-07T20:32:39.0006633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.0007144Z 
2025-05-07T20:32:39.0007567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.0008082Z 
2025-05-07T20:32:39.0008192Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0008610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0009013Z     T=16384,
2025-05-07T20:32:39.0009212Z     D=7168,
2025-05-07T20:32:39.0009409Z     scale_ub=None,
2025-05-07T20:32:39.0009622Z     contiguous=True,
2025-05-07T20:32:39.0009853Z     compiled=True,
2025-05-07T20:32:39.0010064Z )
2025-05-07T20:32:39.1390184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.1390943Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.1391340Z 
2025-05-07T20:32:39.1391450Z     @given(
2025-05-07T20:32:39.1392120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.1392586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.1392916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.1393260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.1393599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.1393890Z     )
2025-05-07T20:32:39.1394247Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.1394697Z     def test_silu_mul_quant(
2025-05-07T20:32:39.1394942Z         self,
2025-05-07T20:32:39.1395145Z         T: int,
2025-05-07T20:32:39.1395353Z         D: int,
2025-05-07T20:32:39.1395576Z         scale_ub: Optional[float],
2025-05-07T20:32:39.1395859Z         contiguous: bool,
2025-05-07T20:32:39.1396111Z         compiled: bool,
2025-05-07T20:32:39.1396341Z     ) -> None:
2025-05-07T20:32:39.1396566Z         torch.manual_seed(2025)
2025-05-07T20:32:39.1396819Z     
2025-05-07T20:32:39.1397102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.1397557Z     
2025-05-07T20:32:39.1397757Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.1398052Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.1398361Z         x = x_sign * x_clamp
2025-05-07T20:32:39.1398606Z         x0 = x[:, :D]
2025-05-07T20:32:39.1398829Z         x1 = x[:, D:]
2025-05-07T20:32:39.1399039Z     
2025-05-07T20:32:39.1399235Z         if contiguous:
2025-05-07T20:32:39.1399473Z             x0 = x0.contiguous()
2025-05-07T20:32:39.1399734Z             x1 = x1.contiguous()
2025-05-07T20:32:39.1399983Z     
2025-05-07T20:32:39.1400181Z         if scale_ub is not None:
2025-05-07T20:32:39.1400454Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.1400795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.1401112Z             )
2025-05-07T20:32:39.1401308Z         else:
2025-05-07T20:32:39.1401529Z             scale_ub_tensor = None
2025-05-07T20:32:39.1401789Z     
2025-05-07T20:32:39.1402029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1402365Z             op = silu_mul_quant
2025-05-07T20:32:39.1402627Z             if compiled:
2025-05-07T20:32:39.1402873Z                 op = torch.compile(op)
2025-05-07T20:32:39.1403178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1403463Z     
2025-05-07T20:32:39.1403794Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.1403962Z 
2025-05-07T20:32:39.1404064Z moe/activation_test.py:117: 
2025-05-07T20:32:39.1404365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1404705Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.1404988Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1405561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.1406138Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.1406950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.1407652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.1408198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.1408892Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.1409564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.1410113Z     kernel = self.compile(
2025-05-07T20:32:39.1410668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.1411341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1411789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1412027Z 
2025-05-07T20:32:39.1412284Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2873e4110>
2025-05-07T20:32:39.1413370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.1414761Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8ba9260>}
2025-05-07T20:32:39.1416104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.1417137Z context = <triton._C.libtriton.ir.context object at 0x7fb287396730>
2025-05-07T20:32:39.1417435Z 
2025-05-07T20:32:39.1417609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.1418191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1418657Z                            module_map=module_map)
2025-05-07T20:32:39.1419031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1419394Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.1419662Z E       ^
2025-05-07T20:32:39.1420123Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1420584Z 
2025-05-07T20:32:39.1421006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.1421525Z 
2025-05-07T20:32:39.1421641Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.1422054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.1422464Z     T=4096,
2025-05-07T20:32:39.1422655Z     D=5120,
2025-05-07T20:32:39.1422860Z     scale_ub=None,
2025-05-07T20:32:39.1423098Z     contiguous=False,
2025-05-07T20:32:39.1423329Z     compiled=True,
2025-05-07T20:32:39.1423538Z )
2025-05-07T20:32:39.1423861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.1424355Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.1424635Z 
2025-05-07T20:32:39.1424712Z     @given(
2025-05-07T20:32:39.1424946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.1425254Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.1425565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.1425896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.1426222Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.1426513Z     )
2025-05-07T20:32:39.1426872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.1427372Z     def test_silu_mul_quant(
2025-05-07T20:32:39.1427616Z         self,
2025-05-07T20:32:39.1427812Z         T: int,
2025-05-07T20:32:39.1428034Z         D: int,
2025-05-07T20:32:39.1435407Z         scale_ub: Optional[float],
2025-05-07T20:32:39.1435693Z         contiguous: bool,
2025-05-07T20:32:39.1435945Z         compiled: bool,
2025-05-07T20:32:39.1436178Z     ) -> None:
2025-05-07T20:32:39.1436403Z         torch.manual_seed(2025)
2025-05-07T20:32:39.1436700Z     
2025-05-07T20:32:39.1436983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.1437325Z     
2025-05-07T20:32:39.1437524Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.1437824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.1438142Z         x = x_sign * x_clamp
2025-05-07T20:32:39.1438600Z         x0 = x[:, :D]
2025-05-07T20:32:39.1438835Z         x1 = x[:, D:]
2025-05-07T20:32:39.1439048Z     
2025-05-07T20:32:39.1439362Z         if contiguous:
2025-05-07T20:32:39.1439672Z             x0 = x0.contiguous()
2025-05-07T20:32:39.1439944Z             x1 = x1.contiguous()
2025-05-07T20:32:39.1440186Z     
2025-05-07T20:32:39.1440387Z         if scale_ub is not None:
2025-05-07T20:32:39.1440674Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.1441038Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.1441355Z             )
2025-05-07T20:32:39.1441558Z         else:
2025-05-07T20:32:39.1441768Z             scale_ub_tensor = None
2025-05-07T20:32:39.1442027Z     
2025-05-07T20:32:39.1442263Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1442573Z             op = silu_mul_quant
2025-05-07T20:32:39.1442834Z             if compiled:
2025-05-07T20:32:39.1443083Z                 op = torch.compile(op)
2025-05-07T20:32:39.1443386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1443758Z     
2025-05-07T20:32:39.1443959Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.1444128Z 
2025-05-07T20:32:39.1444240Z moe/activation_test.py:117: 
2025-05-07T20:32:39.1444607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1444944Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.1445229Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1445790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.1446357Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.1447016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.1447712Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.1448246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.1448935Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.1449606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.1450141Z     kernel = self.compile(
2025-05-07T20:32:39.1450690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.1451354Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1451763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1451992Z 
2025-05-07T20:32:39.1452199Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287323890>
2025-05-07T20:32:39.1453278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.1454727Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8ba9da0>}
2025-05-07T20:32:39.1456082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.1457119Z context = <triton._C.libtriton.ir.context object at 0x7fb28733bef0>
2025-05-07T20:32:39.1457409Z 
2025-05-07T20:32:39.1457580Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.1458111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1458586Z                            module_map=module_map)
2025-05-07T20:32:39.1458950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1459314Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.1459583Z E       ^
2025-05-07T20:32:39.1460135Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1460629Z 
2025-05-07T20:32:39.1461048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.1461576Z 
2025-05-07T20:32:39.2611458Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.2612746Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.2613844Z     T=4096,
2025-05-07T20:32:39.2614343Z     D=5120,
2025-05-07T20:32:39.2614819Z     scale_ub=1200.0,
2025-05-07T20:32:39.2615267Z     contiguous=False,
2025-05-07T20:32:39.2615703Z     compiled=False,
2025-05-07T20:32:39.2616107Z )
2025-05-07T20:32:39.2616741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2617313Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.2617607Z 
2025-05-07T20:32:39.2617705Z     @given(
2025-05-07T20:32:39.2617970Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2618599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2618943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2619282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2619614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2619907Z     )
2025-05-07T20:32:39.2620251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2620697Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2620943Z         self,
2025-05-07T20:32:39.2621130Z         T: int,
2025-05-07T20:32:39.2621329Z         D: int,
2025-05-07T20:32:39.2621552Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2621821Z         contiguous: bool,
2025-05-07T20:32:39.2622062Z         compiled: bool,
2025-05-07T20:32:39.2622294Z     ) -> None:
2025-05-07T20:32:39.2622506Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2622746Z     
2025-05-07T20:32:39.2623030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2623378Z     
2025-05-07T20:32:39.2623574Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2623871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2624184Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2624418Z         x0 = x[:, :D]
2025-05-07T20:32:39.2624635Z         x1 = x[:, D:]
2025-05-07T20:32:39.2624840Z     
2025-05-07T20:32:39.2625020Z         if contiguous:
2025-05-07T20:32:39.2625259Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2625518Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2625757Z     
2025-05-07T20:32:39.2625952Z         if scale_ub is not None:
2025-05-07T20:32:39.2626226Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2626560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2626871Z             )
2025-05-07T20:32:39.2627075Z         else:
2025-05-07T20:32:39.2627378Z             scale_ub_tensor = None
2025-05-07T20:32:39.2627638Z     
2025-05-07T20:32:39.2627874Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2628183Z             op = silu_mul_quant
2025-05-07T20:32:39.2628439Z             if compiled:
2025-05-07T20:32:39.2628694Z                 op = torch.compile(op)
2025-05-07T20:32:39.2629001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2629273Z     
2025-05-07T20:32:39.2629472Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2629636Z 
2025-05-07T20:32:39.2629748Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2630037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2630375Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2630666Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2631449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2632243Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2632795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2633486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2634158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2634700Z     kernel = self.compile(
2025-05-07T20:32:39.2635252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2635917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2636312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2636576Z 
2025-05-07T20:32:39.2636811Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287358b50>
2025-05-07T20:32:39.2637892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2639642Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8bab420>}
2025-05-07T20:32:39.2640990Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2642026Z context = <triton._C.libtriton.ir.context object at 0x7fb28737d1b0>
2025-05-07T20:32:39.2642323Z 
2025-05-07T20:32:39.2642490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2643017Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2643489Z                            module_map=module_map)
2025-05-07T20:32:39.2644016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2644370Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2644625Z E       ^
2025-05-07T20:32:39.2645092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2645552Z 
2025-05-07T20:32:39.2645971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.2646488Z 
2025-05-07T20:32:39.2646601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.2647011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.2647420Z     T=4096,
2025-05-07T20:32:39.2647611Z     D=5120,
2025-05-07T20:32:39.2647798Z     scale_ub=1200.0,
2025-05-07T20:32:39.2648027Z     contiguous=False,
2025-05-07T20:32:39.2648257Z     compiled=True,
2025-05-07T20:32:39.2648540Z )
2025-05-07T20:32:39.2648866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2649363Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.2649636Z 
2025-05-07T20:32:39.2649722Z     @given(
2025-05-07T20:32:39.2649948Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2650269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2650579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2650905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2651238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2651529Z     )
2025-05-07T20:32:39.2651875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2652321Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2652630Z         self,
2025-05-07T20:32:39.2652834Z         T: int,
2025-05-07T20:32:39.2653086Z         D: int,
2025-05-07T20:32:39.2653317Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2653592Z         contiguous: bool,
2025-05-07T20:32:39.2653832Z         compiled: bool,
2025-05-07T20:32:39.2654059Z     ) -> None:
2025-05-07T20:32:39.2654283Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2654526Z     
2025-05-07T20:32:39.2654806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2655153Z     
2025-05-07T20:32:39.2655345Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2655640Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2655949Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2656184Z         x0 = x[:, :D]
2025-05-07T20:32:39.2656405Z         x1 = x[:, D:]
2025-05-07T20:32:39.2656621Z     
2025-05-07T20:32:39.2656809Z         if contiguous:
2025-05-07T20:32:39.2657043Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2657313Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2657550Z     
2025-05-07T20:32:39.2657818Z         if scale_ub is not None:
2025-05-07T20:32:39.2658095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2658430Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2658734Z             )
2025-05-07T20:32:39.2658929Z         else:
2025-05-07T20:32:39.2659141Z             scale_ub_tensor = None
2025-05-07T20:32:39.2659390Z     
2025-05-07T20:32:39.2659644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2659957Z             op = silu_mul_quant
2025-05-07T20:32:39.2660214Z             if compiled:
2025-05-07T20:32:39.2660467Z                 op = torch.compile(op)
2025-05-07T20:32:39.2660759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2661039Z     
2025-05-07T20:32:39.2661234Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2661401Z 
2025-05-07T20:32:39.2661510Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2661805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2662145Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2662436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2662991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.2663559Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.2664228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2664917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2665454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2666145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2666875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2667407Z     kernel = self.compile(
2025-05-07T20:32:39.2668007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2668671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2669070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2669299Z 
2025-05-07T20:32:39.2669507Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2875bfa90>
2025-05-07T20:32:39.2670590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2671997Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2875fc860>}
2025-05-07T20:32:39.2673352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2674428Z context = <triton._C.libtriton.ir.context object at 0x7fb28752c130>
2025-05-07T20:32:39.2674717Z 
2025-05-07T20:32:39.2674885Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2675409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2675883Z                            module_map=module_map)
2025-05-07T20:32:39.2676245Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2676612Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2676878Z E       ^
2025-05-07T20:32:39.2677355Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2677807Z 
2025-05-07T20:32:39.2678233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.2678806Z 
2025-05-07T20:32:39.3565469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.3566772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.3567428Z     T=2048,
2025-05-07T20:32:39.3567644Z     D=7168,
2025-05-07T20:32:39.3567834Z     scale_ub=1200.0,
2025-05-07T20:32:39.3568067Z     contiguous=False,
2025-05-07T20:32:39.3568301Z     compiled=False,
2025-05-07T20:32:39.3568506Z )
2025-05-07T20:32:39.3568833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.3569335Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.3569614Z 
2025-05-07T20:32:39.3569701Z     @given(
2025-05-07T20:32:39.3569931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.3570257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.3570585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.3570910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.3571242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.3571528Z     )
2025-05-07T20:32:39.3571876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.3572323Z     def test_silu_mul_quant(
2025-05-07T20:32:39.3572565Z         self,
2025-05-07T20:32:39.3572764Z         T: int,
2025-05-07T20:32:39.3572958Z         D: int,
2025-05-07T20:32:39.3573185Z         scale_ub: Optional[float],
2025-05-07T20:32:39.3573458Z         contiguous: bool,
2025-05-07T20:32:39.3573694Z         compiled: bool,
2025-05-07T20:32:39.3573921Z     ) -> None:
2025-05-07T20:32:39.3574141Z         torch.manual_seed(2025)
2025-05-07T20:32:39.3574378Z     
2025-05-07T20:32:39.3574656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.3575004Z     
2025-05-07T20:32:39.3575484Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.3575782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.3576096Z         x = x_sign * x_clamp
2025-05-07T20:32:39.3576333Z         x0 = x[:, :D]
2025-05-07T20:32:39.3576554Z         x1 = x[:, D:]
2025-05-07T20:32:39.3576765Z     
2025-05-07T20:32:39.3576949Z         if contiguous:
2025-05-07T20:32:39.3577186Z             x0 = x0.contiguous()
2025-05-07T20:32:39.3577488Z             x1 = x1.contiguous()
2025-05-07T20:32:39.3577731Z     
2025-05-07T20:32:39.3577925Z         if scale_ub is not None:
2025-05-07T20:32:39.3578202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.3578543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.3578851Z             )
2025-05-07T20:32:39.3579052Z         else:
2025-05-07T20:32:39.3579265Z             scale_ub_tensor = None
2025-05-07T20:32:39.3579600Z     
2025-05-07T20:32:39.3579842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.3580238Z             op = silu_mul_quant
2025-05-07T20:32:39.3580486Z             if compiled:
2025-05-07T20:32:39.3580736Z                 op = torch.compile(op)
2025-05-07T20:32:39.3581036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3581310Z     
2025-05-07T20:32:39.3581504Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.3581667Z 
2025-05-07T20:32:39.3581778Z moe/activation_test.py:117: 
2025-05-07T20:32:39.3582082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3582418Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.3582711Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3583408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.3584093Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.3584641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.3585419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.3586091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.3586665Z     kernel = self.compile(
2025-05-07T20:32:39.3587222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.3587889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.3588279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3588517Z 
2025-05-07T20:32:39.3588724Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28754cc10>
2025-05-07T20:32:39.3589806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.3591200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2875fd6c0>}
2025-05-07T20:32:39.3592549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.3593574Z context = <triton._C.libtriton.ir.context object at 0x7fb2875e9270>
2025-05-07T20:32:39.3593868Z 
2025-05-07T20:32:39.3594036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.3594561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.3595037Z                            module_map=module_map)
2025-05-07T20:32:39.3595446Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.3595813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.3596074Z E       ^
2025-05-07T20:32:39.3596532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.3596992Z 
2025-05-07T20:32:39.3597412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.3597935Z 
2025-05-07T20:32:39.3598038Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.3598456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.3598852Z     T=1,
2025-05-07T20:32:39.3599038Z     D=7168,
2025-05-07T20:32:39.3599235Z     scale_ub=None,
2025-05-07T20:32:39.3599442Z     contiguous=True,
2025-05-07T20:32:39.3599671Z     compiled=False,
2025-05-07T20:32:39.3599878Z )
2025-05-07T20:32:39.3600243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.3600771Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:39.3601030Z 
2025-05-07T20:32:39.3601113Z     @given(
2025-05-07T20:32:39.3601336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.3601655Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.3601964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.3602296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.3602621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.3602910Z     )
2025-05-07T20:32:39.3603256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.3603857Z     def test_silu_mul_quant(
2025-05-07T20:32:39.3604100Z         self,
2025-05-07T20:32:39.3604297Z         T: int,
2025-05-07T20:32:39.3604495Z         D: int,
2025-05-07T20:32:39.3604724Z         scale_ub: Optional[float],
2025-05-07T20:32:39.3604998Z         contiguous: bool,
2025-05-07T20:32:39.3605341Z         compiled: bool,
2025-05-07T20:32:39.3605565Z     ) -> None:
2025-05-07T20:32:39.3605783Z         torch.manual_seed(2025)
2025-05-07T20:32:39.3606023Z     
2025-05-07T20:32:39.3606304Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.3606646Z     
2025-05-07T20:32:39.3606845Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.3607135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.3607444Z         x = x_sign * x_clamp
2025-05-07T20:32:39.3607688Z         x0 = x[:, :D]
2025-05-07T20:32:39.3607907Z         x1 = x[:, D:]
2025-05-07T20:32:39.3608122Z     
2025-05-07T20:32:39.3608310Z         if contiguous:
2025-05-07T20:32:39.3608539Z             x0 = x0.contiguous()
2025-05-07T20:32:39.3608801Z             x1 = x1.contiguous()
2025-05-07T20:32:39.3609043Z     
2025-05-07T20:32:39.3609233Z         if scale_ub is not None:
2025-05-07T20:32:39.3609510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.3609859Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.3610162Z             )
2025-05-07T20:32:39.3610356Z         else:
2025-05-07T20:32:39.3610569Z             scale_ub_tensor = None
2025-05-07T20:32:39.3610816Z     
2025-05-07T20:32:39.3611047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.3611362Z             op = silu_mul_quant
2025-05-07T20:32:39.3611617Z             if compiled:
2025-05-07T20:32:39.3611855Z                 op = torch.compile(op)
2025-05-07T20:32:39.3612156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3612432Z     
2025-05-07T20:32:39.3612618Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.3612786Z 
2025-05-07T20:32:39.3612885Z moe/activation_test.py:117: 
2025-05-07T20:32:39.3613180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3613507Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.3613838Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3614549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.3615240Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.3615772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.3616489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.3617188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.3617712Z     kernel = self.compile(
2025-05-07T20:32:39.3618262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.3618924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.3619374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3619645Z 
2025-05-07T20:32:39.3619852Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a83f8210>
2025-05-07T20:32:39.3620935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.3622301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2875fcfe0>}
2025-05-07T20:32:39.3623652Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.3632999Z context = <triton._C.libtriton.ir.context object at 0x7fb3a83dca30>
2025-05-07T20:32:39.3633339Z 
2025-05-07T20:32:39.3633533Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.3634139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.3634613Z                            module_map=module_map)
2025-05-07T20:32:39.3634988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.3635347Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.3635604Z E       ^
2025-05-07T20:32:39.3636083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.3636532Z 
2025-05-07T20:32:39.3636969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.3637484Z 
2025-05-07T20:32:39.3637601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.3638015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.3638625Z     T=16384,
2025-05-07T20:32:39.3638837Z     D=7168,
2025-05-07T20:32:39.3639028Z     scale_ub=1200.0,
2025-05-07T20:32:39.3639262Z     contiguous=False,
2025-05-07T20:32:39.3639497Z     compiled=True,
2025-05-07T20:32:39.7234787Z )
2025-05-07T20:32:39.7235414Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7236146Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.7236623Z 
2025-05-07T20:32:39.7236860Z     @given(
2025-05-07T20:32:39.7237380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7238012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7238948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7239609Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7240254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7240846Z     )
2025-05-07T20:32:39.7241932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7242834Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7243325Z         self,
2025-05-07T20:32:39.7243894Z         T: int,
2025-05-07T20:32:39.7244275Z         D: int,
2025-05-07T20:32:39.7244708Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7245252Z         contiguous: bool,
2025-05-07T20:32:39.7245720Z         compiled: bool,
2025-05-07T20:32:39.7246167Z     ) -> None:
2025-05-07T20:32:39.7246593Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7247015Z     
2025-05-07T20:32:39.7247345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7247695Z     
2025-05-07T20:32:39.7247896Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7248190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7248505Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7248749Z         x0 = x[:, :D]
2025-05-07T20:32:39.7249065Z         x1 = x[:, D:]
2025-05-07T20:32:39.7249355Z     
2025-05-07T20:32:39.7249557Z         if contiguous:
2025-05-07T20:32:39.7249793Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7250067Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7250315Z     
2025-05-07T20:32:39.7250509Z         if scale_ub is not None:
2025-05-07T20:32:39.7250796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7251143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7251450Z             )
2025-05-07T20:32:39.7251653Z         else:
2025-05-07T20:32:39.7251874Z             scale_ub_tensor = None
2025-05-07T20:32:39.7252132Z     
2025-05-07T20:32:39.7252378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7252700Z             op = silu_mul_quant
2025-05-07T20:32:39.7252952Z             if compiled:
2025-05-07T20:32:39.7253212Z                 op = torch.compile(op)
2025-05-07T20:32:39.7253524Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7253811Z     
2025-05-07T20:32:39.7254019Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.7254278Z 
2025-05-07T20:32:39.7254388Z moe/activation_test.py:117: 
2025-05-07T20:32:39.7254680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7255019Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.7255306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7255865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.7256437Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.7257108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.7257801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.7258341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7259034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7259710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7260246Z     kernel = self.compile(
2025-05-07T20:32:39.7260794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7261456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7261859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7262091Z 
2025-05-07T20:32:39.7262300Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a840b190>
2025-05-07T20:32:39.7263383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7264817Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2875ffb00>}
2025-05-07T20:32:39.7266169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7267190Z context = <triton._C.libtriton.ir.context object at 0x7fb3a836b4f0>
2025-05-07T20:32:39.7267505Z 
2025-05-07T20:32:39.7267681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7268202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7268671Z                            module_map=module_map)
2025-05-07T20:32:39.7269038Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7269433Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.7269737Z E       ^
2025-05-07T20:32:39.7270209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7270661Z 
2025-05-07T20:32:39.7271088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7271603Z 
2025-05-07T20:32:39.7271708Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7272125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7272533Z     T=1,
2025-05-07T20:32:39.7272714Z     D=7168,
2025-05-07T20:32:39.7272911Z     scale_ub=None,
2025-05-07T20:32:39.7273139Z     contiguous=False,
2025-05-07T20:32:39.7273360Z     compiled=False,
2025-05-07T20:32:39.7273571Z )
2025-05-07T20:32:39.7273894Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7274383Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:39.7274662Z 
2025-05-07T20:32:39.7274791Z     @given(
2025-05-07T20:32:39.7275027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7275342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7275647Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7275980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7276312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7276596Z     )
2025-05-07T20:32:39.7276949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7277394Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7277633Z         self,
2025-05-07T20:32:39.7277834Z         T: int,
2025-05-07T20:32:39.7278042Z         D: int,
2025-05-07T20:32:39.7278260Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7278534Z         contiguous: bool,
2025-05-07T20:32:39.7278779Z         compiled: bool,
2025-05-07T20:32:39.7279004Z     ) -> None:
2025-05-07T20:32:39.7279217Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7279463Z     
2025-05-07T20:32:39.7279736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7280069Z     
2025-05-07T20:32:39.7280269Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7280565Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7280867Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7281110Z         x0 = x[:, :D]
2025-05-07T20:32:39.7281329Z         x1 = x[:, D:]
2025-05-07T20:32:39.7281535Z     
2025-05-07T20:32:39.7281724Z         if contiguous:
2025-05-07T20:32:39.7281957Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7282208Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7282450Z     
2025-05-07T20:32:39.7282645Z         if scale_ub is not None:
2025-05-07T20:32:39.7282917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7283256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7283670Z             )
2025-05-07T20:32:39.7283922Z         else:
2025-05-07T20:32:39.7284131Z             scale_ub_tensor = None
2025-05-07T20:32:39.7284387Z     
2025-05-07T20:32:39.7284622Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7284932Z             op = silu_mul_quant
2025-05-07T20:32:39.7285181Z             if compiled:
2025-05-07T20:32:39.7285429Z                 op = torch.compile(op)
2025-05-07T20:32:39.7285721Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7285998Z     
2025-05-07T20:32:39.7286193Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.7286357Z 
2025-05-07T20:32:39.7286460Z moe/activation_test.py:117: 
2025-05-07T20:32:39.7286757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7287093Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.7287377Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7288114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.7288854Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.7289399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7290078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7290752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7291292Z     kernel = self.compile(
2025-05-07T20:32:39.7291844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7292500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7292900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7293129Z 
2025-05-07T20:32:39.7293350Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a83547d0>
2025-05-07T20:32:39.7294473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7295834Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a83749a0>}
2025-05-07T20:32:39.7297177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7298212Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8350df0>
2025-05-07T20:32:39.7298499Z 
2025-05-07T20:32:39.7298676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7299197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7299672Z                            module_map=module_map)
2025-05-07T20:32:39.7300037Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7300395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.7300652Z E       ^
2025-05-07T20:32:39.7301116Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7301568Z 
2025-05-07T20:32:39.7301992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7302506Z 
2025-05-07T20:32:39.7302612Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7303025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7303450Z     T=2048,
2025-05-07T20:32:39.7303642Z     D=7168,
2025-05-07T20:32:39.7303837Z     scale_ub=None,
2025-05-07T20:32:39.7304105Z     contiguous=False,
2025-05-07T20:32:39.7304337Z     compiled=True,
2025-05-07T20:32:39.7304554Z )
2025-05-07T20:32:39.7995394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7996193Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.7996594Z 
2025-05-07T20:32:39.7996722Z     @given(
2025-05-07T20:32:39.7996958Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7997281Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7997596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7997927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7998255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7998550Z     )
2025-05-07T20:32:39.7998906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7999685Z     def test_silu_mul_quant(
2025-05-07T20:32:39.8000039Z         self,
2025-05-07T20:32:39.8000251Z         T: int,
2025-05-07T20:32:39.8000446Z         D: int,
2025-05-07T20:32:39.8000669Z         scale_ub: Optional[float],
2025-05-07T20:32:39.8000948Z         contiguous: bool,
2025-05-07T20:32:39.8001187Z         compiled: bool,
2025-05-07T20:32:39.8001424Z     ) -> None:
2025-05-07T20:32:39.8001641Z         torch.manual_seed(2025)
2025-05-07T20:32:39.8001881Z     
2025-05-07T20:32:39.8002159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.8002502Z     
2025-05-07T20:32:39.8002709Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.8003000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.8003315Z         x = x_sign * x_clamp
2025-05-07T20:32:39.8003690Z         x0 = x[:, :D]
2025-05-07T20:32:39.8003910Z         x1 = x[:, D:]
2025-05-07T20:32:39.8004126Z     
2025-05-07T20:32:39.8004318Z         if contiguous:
2025-05-07T20:32:39.8004554Z             x0 = x0.contiguous()
2025-05-07T20:32:39.8004823Z             x1 = x1.contiguous()
2025-05-07T20:32:39.8005160Z     
2025-05-07T20:32:39.8005353Z         if scale_ub is not None:
2025-05-07T20:32:39.8005634Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.8005982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.8006286Z             )
2025-05-07T20:32:39.8006498Z         else:
2025-05-07T20:32:39.8006752Z             scale_ub_tensor = None
2025-05-07T20:32:39.8007008Z     
2025-05-07T20:32:39.8007245Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8007565Z             op = silu_mul_quant
2025-05-07T20:32:39.8007830Z             if compiled:
2025-05-07T20:32:39.8008078Z                 op = torch.compile(op)
2025-05-07T20:32:39.8008380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8008672Z     
2025-05-07T20:32:39.8008864Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.8009039Z 
2025-05-07T20:32:39.8009141Z moe/activation_test.py:117: 
2025-05-07T20:32:39.8009456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8009789Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.8010078Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8010652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.8011224Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.8011896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.8012592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.8013147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.8013833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.8014606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.8015160Z     kernel = self.compile(
2025-05-07T20:32:39.8015718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.8016382Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.8016789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8017021Z 
2025-05-07T20:32:39.8017241Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28772f910>
2025-05-07T20:32:39.8018348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.8019791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8375d00>}
2025-05-07T20:32:39.8021181Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.8022213Z context = <triton._C.libtriton.ir.context object at 0x7fb28778bf70>
2025-05-07T20:32:39.8022500Z 
2025-05-07T20:32:39.8022680Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.8023199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.8023670Z                            module_map=module_map)
2025-05-07T20:32:39.8024050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.8024426Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.8024692Z E       ^
2025-05-07T20:32:39.8025196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.8025699Z 
2025-05-07T20:32:39.8026129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.8026648Z 
2025-05-07T20:32:39.8026755Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.8027178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.8027584Z     T=4096,
2025-05-07T20:32:39.8027778Z     D=7168,
2025-05-07T20:32:39.8027969Z     scale_ub=None,
2025-05-07T20:32:39.8028190Z     contiguous=False,
2025-05-07T20:32:39.8028420Z     compiled=True,
2025-05-07T20:32:39.8028624Z )
2025-05-07T20:32:39.8028954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.8029455Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.8029728Z 
2025-05-07T20:32:39.8029810Z     @given(
2025-05-07T20:32:39.8030050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.8030374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.8030678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.8031022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.8031358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.8031652Z     )
2025-05-07T20:32:39.8031999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.8032452Z     def test_silu_mul_quant(
2025-05-07T20:32:39.8032704Z         self,
2025-05-07T20:32:39.8032896Z         T: int,
2025-05-07T20:32:39.8033098Z         D: int,
2025-05-07T20:32:39.8033324Z         scale_ub: Optional[float],
2025-05-07T20:32:39.8033593Z         contiguous: bool,
2025-05-07T20:32:39.8033846Z         compiled: bool,
2025-05-07T20:32:39.8034076Z     ) -> None:
2025-05-07T20:32:39.8034288Z         torch.manual_seed(2025)
2025-05-07T20:32:39.8034545Z     
2025-05-07T20:32:39.8034877Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.8035229Z     
2025-05-07T20:32:39.8035428Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.8035730Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.8036047Z         x = x_sign * x_clamp
2025-05-07T20:32:39.8036285Z         x0 = x[:, :D]
2025-05-07T20:32:39.8036505Z         x1 = x[:, D:]
2025-05-07T20:32:39.8036740Z     
2025-05-07T20:32:39.8036947Z         if contiguous:
2025-05-07T20:32:39.8037182Z             x0 = x0.contiguous()
2025-05-07T20:32:39.8037447Z             x1 = x1.contiguous()
2025-05-07T20:32:39.8037686Z     
2025-05-07T20:32:39.8037881Z         if scale_ub is not None:
2025-05-07T20:32:39.8038159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.8038864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.8039184Z             )
2025-05-07T20:32:39.8039383Z         else:
2025-05-07T20:32:39.8039666Z             scale_ub_tensor = None
2025-05-07T20:32:39.8039981Z     
2025-05-07T20:32:39.8040225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8040540Z             op = silu_mul_quant
2025-05-07T20:32:39.8040798Z             if compiled:
2025-05-07T20:32:39.8041050Z                 op = torch.compile(op)
2025-05-07T20:32:39.8041344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8041623Z     
2025-05-07T20:32:39.8041818Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.8041983Z 
2025-05-07T20:32:39.8042094Z moe/activation_test.py:117: 
2025-05-07T20:32:39.8042387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8042722Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.8043014Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8043708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.8044285Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.8044958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.8045741Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.8046282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.8046978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.8047656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.8048198Z     kernel = self.compile(
2025-05-07T20:32:39.8048753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.8049423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.8049829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8050064Z 
2025-05-07T20:32:39.8050286Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2877b90d0>
2025-05-07T20:32:39.8051377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.8052750Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8376840>}
2025-05-07T20:32:39.8054102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.8055138Z context = <triton._C.libtriton.ir.context object at 0x7fb287799730>
2025-05-07T20:32:39.8055426Z 
2025-05-07T20:32:39.8055666Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.8056204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.8056678Z                            module_map=module_map)
2025-05-07T20:32:39.8057043Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.8057455Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.8057724Z E       ^
2025-05-07T20:32:39.8058193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.8058644Z 
2025-05-07T20:32:39.8059065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.8059587Z 
2025-05-07T20:32:39.9335565Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9336255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9337128Z     T=16384,
2025-05-07T20:32:39.9337488Z     D=5120,
2025-05-07T20:32:39.9337764Z     scale_ub=1200.0,
2025-05-07T20:32:39.9338050Z     contiguous=False,
2025-05-07T20:32:39.9338339Z     compiled=False,
2025-05-07T20:32:39.9338830Z )
2025-05-07T20:32:39.9339158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.9339662Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.9339942Z 
2025-05-07T20:32:39.9340020Z     @given(
2025-05-07T20:32:39.9340249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.9340562Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.9340864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.9341196Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.9341521Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.9341806Z     )
2025-05-07T20:32:39.9342150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.9342598Z     def test_silu_mul_quant(
2025-05-07T20:32:39.9342968Z         self,
2025-05-07T20:32:39.9343157Z         T: int,
2025-05-07T20:32:39.9343357Z         D: int,
2025-05-07T20:32:39.9343575Z         scale_ub: Optional[float],
2025-05-07T20:32:39.9343842Z         contiguous: bool,
2025-05-07T20:32:39.9344106Z         compiled: bool,
2025-05-07T20:32:39.9344341Z     ) -> None:
2025-05-07T20:32:39.9344555Z         torch.manual_seed(2025)
2025-05-07T20:32:39.9353807Z     
2025-05-07T20:32:39.9354091Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.9354429Z     
2025-05-07T20:32:39.9354640Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.9354936Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.9355254Z         x = x_sign * x_clamp
2025-05-07T20:32:39.9355502Z         x0 = x[:, :D]
2025-05-07T20:32:39.9355719Z         x1 = x[:, D:]
2025-05-07T20:32:39.9355941Z     
2025-05-07T20:32:39.9356139Z         if contiguous:
2025-05-07T20:32:39.9356384Z             x0 = x0.contiguous()
2025-05-07T20:32:39.9356655Z             x1 = x1.contiguous()
2025-05-07T20:32:39.9356904Z     
2025-05-07T20:32:39.9357097Z         if scale_ub is not None:
2025-05-07T20:32:39.9357383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.9357727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.9358032Z             )
2025-05-07T20:32:39.9358239Z         else:
2025-05-07T20:32:39.9358455Z             scale_ub_tensor = None
2025-05-07T20:32:39.9358699Z     
2025-05-07T20:32:39.9358941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.9359263Z             op = silu_mul_quant
2025-05-07T20:32:39.9359519Z             if compiled:
2025-05-07T20:32:39.9359769Z                 op = torch.compile(op)
2025-05-07T20:32:39.9360069Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9360352Z     
2025-05-07T20:32:39.9360547Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.9360724Z 
2025-05-07T20:32:39.9360958Z moe/activation_test.py:117: 
2025-05-07T20:32:39.9361261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9361597Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.9361881Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9362579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.9363274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.9363975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.9364665Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.9365338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.9365945Z     kernel = self.compile(
2025-05-07T20:32:39.9366500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.9367272Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.9367685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9367914Z 
2025-05-07T20:32:39.9368123Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287f36210>
2025-05-07T20:32:39.9369203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.9370580Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287f14040>}
2025-05-07T20:32:39.9371927Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.9372998Z context = <triton._C.libtriton.ir.context object at 0x7fb287ffe870>
2025-05-07T20:32:39.9373286Z 
2025-05-07T20:32:39.9373455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.9373982Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.9374458Z                            module_map=module_map)
2025-05-07T20:32:39.9374822Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.9375182Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.9375450Z E       ^
2025-05-07T20:32:39.9375924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.9376375Z 
2025-05-07T20:32:39.9376799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.9377325Z 
2025-05-07T20:32:39.9377428Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9377871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9378281Z     T=16384,
2025-05-07T20:32:39.9378474Z     D=5120,
2025-05-07T20:32:39.9378671Z     scale_ub=1200.0,
2025-05-07T20:32:39.9378899Z     contiguous=True,
2025-05-07T20:32:39.9379117Z     compiled=True,
2025-05-07T20:32:39.9379327Z )
2025-05-07T20:32:39.9379657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.9380164Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:39.9380439Z 
2025-05-07T20:32:39.9380516Z     @given(
2025-05-07T20:32:39.9380752Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.9381069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.9381422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.9381760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.9382087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.9382370Z     )
2025-05-07T20:32:39.9382722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.9383173Z     def test_silu_mul_quant(
2025-05-07T20:32:39.9383420Z         self,
2025-05-07T20:32:39.9383613Z         T: int,
2025-05-07T20:32:39.9383819Z         D: int,
2025-05-07T20:32:39.9384046Z         scale_ub: Optional[float],
2025-05-07T20:32:39.9384311Z         contiguous: bool,
2025-05-07T20:32:39.9384553Z         compiled: bool,
2025-05-07T20:32:39.9384772Z     ) -> None:
2025-05-07T20:32:39.9384986Z         torch.manual_seed(2025)
2025-05-07T20:32:39.9385231Z     
2025-05-07T20:32:39.9385498Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.9385883Z     
2025-05-07T20:32:39.9386078Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.9386408Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.9386718Z         x = x_sign * x_clamp
2025-05-07T20:32:39.9386962Z         x0 = x[:, :D]
2025-05-07T20:32:39.9387176Z         x1 = x[:, D:]
2025-05-07T20:32:39.9387387Z     
2025-05-07T20:32:39.9387578Z         if contiguous:
2025-05-07T20:32:39.9387811Z             x0 = x0.contiguous()
2025-05-07T20:32:39.9388063Z             x1 = x1.contiguous()
2025-05-07T20:32:39.9388302Z     
2025-05-07T20:32:39.9388494Z         if scale_ub is not None:
2025-05-07T20:32:39.9388761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.9389094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.9389402Z             )
2025-05-07T20:32:39.9389588Z         else:
2025-05-07T20:32:39.9389797Z             scale_ub_tensor = None
2025-05-07T20:32:39.9390049Z     
2025-05-07T20:32:39.9390277Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.9390597Z             op = silu_mul_quant
2025-05-07T20:32:39.9390915Z             if compiled:
2025-05-07T20:32:39.9391157Z                 op = torch.compile(op)
2025-05-07T20:32:39.9391455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9391733Z     
2025-05-07T20:32:39.9391919Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.9392089Z 
2025-05-07T20:32:39.9392187Z moe/activation_test.py:117: 
2025-05-07T20:32:39.9392482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9392817Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.9393093Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9393656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.9394220Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.9394876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.9395576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.9396118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.9396857Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.9397520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.9398054Z     kernel = self.compile(
2025-05-07T20:32:39.9398600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.9399262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.9399652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9399886Z 
2025-05-07T20:32:39.9400096Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287f04410>
2025-05-07T20:32:39.9401217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.9402588Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287f15300>}
2025-05-07T20:32:39.9404010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.9405038Z context = <triton._C.libtriton.ir.context object at 0x7fb287f99930>
2025-05-07T20:32:39.9405329Z 
2025-05-07T20:32:39.9405495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.9406068Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.9406569Z                            module_map=module_map)
2025-05-07T20:32:39.9406937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.9407291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.9407546Z E       ^
2025-05-07T20:32:39.9408013Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.9408470Z 
2025-05-07T20:32:39.9408887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.9409401Z 
2025-05-07T20:32:40.2640687Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.2641953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.2643065Z     T=16384,
2025-05-07T20:32:40.2643469Z     D=5120,
2025-05-07T20:32:40.2644046Z     scale_ub=None,
2025-05-07T20:32:40.2644509Z     contiguous=False,
2025-05-07T20:32:40.2644980Z     compiled=True,
2025-05-07T20:32:40.2645774Z )
2025-05-07T20:32:40.2646418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.2647088Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.2647370Z 
2025-05-07T20:32:40.2647456Z     @given(
2025-05-07T20:32:40.2647686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.2648001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.2648310Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.2648636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.2648966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.2649254Z     )
2025-05-07T20:32:40.2649601Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.2650053Z     def test_silu_mul_quant(
2025-05-07T20:32:40.2650308Z         self,
2025-05-07T20:32:40.2650502Z         T: int,
2025-05-07T20:32:40.2650712Z         D: int,
2025-05-07T20:32:40.2650935Z         scale_ub: Optional[float],
2025-05-07T20:32:40.2651210Z         contiguous: bool,
2025-05-07T20:32:40.2651447Z         compiled: bool,
2025-05-07T20:32:40.2651681Z     ) -> None:
2025-05-07T20:32:40.2651908Z         torch.manual_seed(2025)
2025-05-07T20:32:40.2652145Z     
2025-05-07T20:32:40.2652420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.2652764Z     
2025-05-07T20:32:40.2652956Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.2653252Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.2653561Z         x = x_sign * x_clamp
2025-05-07T20:32:40.2653796Z         x0 = x[:, :D]
2025-05-07T20:32:40.2654014Z         x1 = x[:, D:]
2025-05-07T20:32:40.2654230Z     
2025-05-07T20:32:40.2654416Z         if contiguous:
2025-05-07T20:32:40.2654651Z             x0 = x0.contiguous()
2025-05-07T20:32:40.2654920Z             x1 = x1.contiguous()
2025-05-07T20:32:40.2655249Z     
2025-05-07T20:32:40.2655455Z         if scale_ub is not None:
2025-05-07T20:32:40.2655754Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.2656094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.2656397Z             )
2025-05-07T20:32:40.2656598Z         else:
2025-05-07T20:32:40.2656811Z             scale_ub_tensor = None
2025-05-07T20:32:40.2657059Z     
2025-05-07T20:32:40.2657301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.2657620Z             op = silu_mul_quant
2025-05-07T20:32:40.2657868Z             if compiled:
2025-05-07T20:32:40.2658119Z                 op = torch.compile(op)
2025-05-07T20:32:40.2658418Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.2658689Z     
2025-05-07T20:32:40.2658888Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.2659053Z 
2025-05-07T20:32:40.2659160Z moe/activation_test.py:117: 
2025-05-07T20:32:40.2659537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.2659982Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.2660267Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.2660832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.2661391Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.2662063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.2662753Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.2663290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.2663980Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.2664657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.2665200Z     kernel = self.compile(
2025-05-07T20:32:40.2665785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.2666447Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.2666852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.2667120Z 
2025-05-07T20:32:40.2667346Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28765a850>
2025-05-07T20:32:40.2668419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.2669808Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287f15e40>}
2025-05-07T20:32:40.2671161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.2672195Z context = <triton._C.libtriton.ir.context object at 0x7fb2876aeeb0>
2025-05-07T20:32:40.2672480Z 
2025-05-07T20:32:40.2672657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.2673177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.2673649Z                            module_map=module_map)
2025-05-07T20:32:40.2674017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.2674363Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.2674630Z E       ^
2025-05-07T20:32:40.2675100Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.2675555Z 
2025-05-07T20:32:40.2676023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.2676545Z 
2025-05-07T20:32:40.2676650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.2677070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.2677475Z     T=2048,
2025-05-07T20:32:40.2677660Z     D=5120,
2025-05-07T20:32:40.2677855Z     scale_ub=None,
2025-05-07T20:32:40.2678071Z     contiguous=False,
2025-05-07T20:32:40.2678293Z     compiled=True,
2025-05-07T20:32:40.2678497Z )
2025-05-07T20:32:40.3409090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3409927Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.3410307Z 
2025-05-07T20:32:40.3410430Z     @given(
2025-05-07T20:32:40.3410972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3411298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3411704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3412038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3412362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3412655Z     )
2025-05-07T20:32:40.3413006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3413444Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3413692Z         self,
2025-05-07T20:32:40.3413894Z         T: int,
2025-05-07T20:32:40.3414090Z         D: int,
2025-05-07T20:32:40.3414318Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3414599Z         contiguous: bool,
2025-05-07T20:32:40.3414836Z         compiled: bool,
2025-05-07T20:32:40.3415069Z     ) -> None:
2025-05-07T20:32:40.3415289Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3415532Z     
2025-05-07T20:32:40.3415814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3416165Z     
2025-05-07T20:32:40.3416453Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3416770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3417107Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3417354Z         x0 = x[:, :D]
2025-05-07T20:32:40.3417566Z         x1 = x[:, D:]
2025-05-07T20:32:40.3417780Z     
2025-05-07T20:32:40.3417977Z         if contiguous:
2025-05-07T20:32:40.3418210Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3418475Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3418716Z     
2025-05-07T20:32:40.3418907Z         if scale_ub is not None:
2025-05-07T20:32:40.3419187Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3419528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3419836Z             )
2025-05-07T20:32:40.3420034Z         else:
2025-05-07T20:32:40.3420247Z             scale_ub_tensor = None
2025-05-07T20:32:40.3420500Z     
2025-05-07T20:32:40.3420740Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3421064Z             op = silu_mul_quant
2025-05-07T20:32:40.3421317Z             if compiled:
2025-05-07T20:32:40.3421562Z                 op = torch.compile(op)
2025-05-07T20:32:40.3421859Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3422140Z     
2025-05-07T20:32:40.3422333Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.3422504Z 
2025-05-07T20:32:40.3422607Z moe/activation_test.py:117: 
2025-05-07T20:32:40.3422907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3423241Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.3423533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3424098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.3424664Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.3425417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.3426120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.3426667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3427352Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3428025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3428570Z     kernel = self.compile(
2025-05-07T20:32:40.3429125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3429788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3430192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3430471Z 
2025-05-07T20:32:40.3430697Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287617d10>
2025-05-07T20:32:40.3432084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3433830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287f17240>}
2025-05-07T20:32:40.3435520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3436799Z context = <triton._C.libtriton.ir.context object at 0x7fb2876982f0>
2025-05-07T20:32:40.3437139Z 
2025-05-07T20:32:40.3437337Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3437956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3438852Z                            module_map=module_map)
2025-05-07T20:32:40.3439218Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3439579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3439836Z E       ^
2025-05-07T20:32:40.3440305Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3440761Z 
2025-05-07T20:32:40.3441188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3441710Z 
2025-05-07T20:32:40.3441823Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3442233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3442636Z     T=2048,
2025-05-07T20:32:40.3442830Z     D=5120,
2025-05-07T20:32:40.3443021Z     scale_ub=1200.0,
2025-05-07T20:32:40.3443261Z     contiguous=False,
2025-05-07T20:32:40.3443489Z     compiled=True,
2025-05-07T20:32:40.3443792Z )
2025-05-07T20:32:40.3444121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3444623Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.3444895Z 
2025-05-07T20:32:40.3444972Z     @given(
2025-05-07T20:32:40.3445204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3445520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3445829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3446155Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3446486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3446799Z     )
2025-05-07T20:32:40.3447171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3447619Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3447937Z         self,
2025-05-07T20:32:40.3448134Z         T: int,
2025-05-07T20:32:40.3448335Z         D: int,
2025-05-07T20:32:40.3448557Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3448827Z         contiguous: bool,
2025-05-07T20:32:40.3449072Z         compiled: bool,
2025-05-07T20:32:40.3449296Z     ) -> None:
2025-05-07T20:32:40.3449507Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3449751Z     
2025-05-07T20:32:40.3450029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3450367Z     
2025-05-07T20:32:40.3450565Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3450862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3451169Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3451402Z         x0 = x[:, :D]
2025-05-07T20:32:40.3451622Z         x1 = x[:, D:]
2025-05-07T20:32:40.3451836Z     
2025-05-07T20:32:40.3452093Z         if contiguous:
2025-05-07T20:32:40.3452331Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3452656Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3452892Z     
2025-05-07T20:32:40.3453085Z         if scale_ub is not None:
2025-05-07T20:32:40.3453361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3453690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3454004Z             )
2025-05-07T20:32:40.3454197Z         else:
2025-05-07T20:32:40.3454403Z             scale_ub_tensor = None
2025-05-07T20:32:40.3454657Z     
2025-05-07T20:32:40.3454891Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3455198Z             op = silu_mul_quant
2025-05-07T20:32:40.3455455Z             if compiled:
2025-05-07T20:32:40.3455701Z                 op = torch.compile(op)
2025-05-07T20:32:40.3455999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3456274Z     
2025-05-07T20:32:40.3456469Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.3456636Z 
2025-05-07T20:32:40.3456744Z moe/activation_test.py:117: 
2025-05-07T20:32:40.3457112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3457449Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.3457734Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3458292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.3458864Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.3459539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.3460239Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.3460777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3461471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3470319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3471030Z     kernel = self.compile(
2025-05-07T20:32:40.3471695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3472492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3472959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3473243Z 
2025-05-07T20:32:40.3473480Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28764cd90>
2025-05-07T20:32:40.3474623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3476061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287618720>}
2025-05-07T20:32:40.3477411Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3478439Z context = <triton._C.libtriton.ir.context object at 0x7fb2871813f0>
2025-05-07T20:32:40.3478726Z 
2025-05-07T20:32:40.3478901Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3479430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3479895Z                            module_map=module_map)
2025-05-07T20:32:40.3480265Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3480623Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3480881Z E       ^
2025-05-07T20:32:40.3481397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3481946Z 
2025-05-07T20:32:40.3482373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3482889Z 
2025-05-07T20:32:40.4821283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.4822243Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.4822976Z     T=4096,
2025-05-07T20:32:40.4823279Z     D=5120,
2025-05-07T20:32:40.4823576Z     scale_ub=1200.0,
2025-05-07T20:32:40.4823934Z     contiguous=True,
2025-05-07T20:32:40.4824290Z     compiled=True,
2025-05-07T20:32:40.4824616Z )
2025-05-07T20:32:40.4825138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.4825965Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.4826376Z 
2025-05-07T20:32:40.4826516Z     @given(
2025-05-07T20:32:40.4826890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.4827679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.4828163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.4828700Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.4829240Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.4829709Z     )
2025-05-07T20:32:40.4830289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.4831039Z     def test_silu_mul_quant(
2025-05-07T20:32:40.4831430Z         self,
2025-05-07T20:32:40.4831738Z         T: int,
2025-05-07T20:32:40.4832060Z         D: int,
2025-05-07T20:32:40.4832416Z         scale_ub: Optional[float],
2025-05-07T20:32:40.4832853Z         contiguous: bool,
2025-05-07T20:32:40.4833246Z         compiled: bool,
2025-05-07T20:32:40.4833610Z     ) -> None:
2025-05-07T20:32:40.4833944Z         torch.manual_seed(2025)
2025-05-07T20:32:40.4834344Z     
2025-05-07T20:32:40.4834791Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.4835363Z     
2025-05-07T20:32:40.4835677Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.4836144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.4836639Z         x = x_sign * x_clamp
2025-05-07T20:32:40.4837056Z         x0 = x[:, :D]
2025-05-07T20:32:40.4837420Z         x1 = x[:, D:]
2025-05-07T20:32:40.4837754Z     
2025-05-07T20:32:40.4838042Z         if contiguous:
2025-05-07T20:32:40.4838640Z             x0 = x0.contiguous()
2025-05-07T20:32:40.4839016Z             x1 = x1.contiguous()
2025-05-07T20:32:40.4839359Z     
2025-05-07T20:32:40.4839679Z         if scale_ub is not None:
2025-05-07T20:32:40.4840129Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.4840697Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.4841159Z             )
2025-05-07T20:32:40.4841425Z         else:
2025-05-07T20:32:40.4841711Z             scale_ub_tensor = None
2025-05-07T20:32:40.4842232Z     
2025-05-07T20:32:40.4842578Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.4843026Z             op = silu_mul_quant
2025-05-07T20:32:40.4843388Z             if compiled:
2025-05-07T20:32:40.4843885Z                 op = torch.compile(op)
2025-05-07T20:32:40.4844318Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4844716Z     
2025-05-07T20:32:40.4844995Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.4845239Z 
2025-05-07T20:32:40.4845396Z moe/activation_test.py:117: 
2025-05-07T20:32:40.4845829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4846314Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.4846726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4847627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.4848673Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.4849922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.4851136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.4852077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.4853302Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.4854503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.4855452Z     kernel = self.compile(
2025-05-07T20:32:40.4856398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.4857589Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.4858278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4858686Z 
2025-05-07T20:32:40.4859182Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28717a210>
2025-05-07T20:32:40.4861033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.4863443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287619260>}
2025-05-07T20:32:40.4865753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.4867591Z context = <triton._C.libtriton.ir.context object at 0x7fb2871fe7b0>
2025-05-07T20:32:40.4868134Z 
2025-05-07T20:32:40.4868435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.4869357Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.4870153Z                            module_map=module_map)
2025-05-07T20:32:40.4870777Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.4871373Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.4871814Z E       ^
2025-05-07T20:32:40.4872601Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.4873428Z 
2025-05-07T20:32:40.4874170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.4875101Z 
2025-05-07T20:32:40.4875272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.4875987Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.4876695Z     T=128,
2025-05-07T20:32:40.4877101Z     D=5120,
2025-05-07T20:32:40.4877427Z     scale_ub=1200.0,
2025-05-07T20:32:40.4877784Z     contiguous=False,
2025-05-07T20:32:40.4878130Z     compiled=True,
2025-05-07T20:32:40.4878468Z )
2025-05-07T20:32:40.7581437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.7582374Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.7582862Z 
2025-05-07T20:32:40.7583001Z     @given(
2025-05-07T20:32:40.7583385Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.7583922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.7584447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.7585015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.7585565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.7586049Z     )
2025-05-07T20:32:40.7587067Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.7587980Z     def test_silu_mul_quant(
2025-05-07T20:32:40.7588393Z         self,
2025-05-07T20:32:40.7588710Z         T: int,
2025-05-07T20:32:40.7589024Z         D: int,
2025-05-07T20:32:40.7589385Z         scale_ub: Optional[float],
2025-05-07T20:32:40.7589843Z         contiguous: bool,
2025-05-07T20:32:40.7590240Z         compiled: bool,
2025-05-07T20:32:40.7590612Z     ) -> None:
2025-05-07T20:32:40.7590972Z         torch.manual_seed(2025)
2025-05-07T20:32:40.7591379Z     
2025-05-07T20:32:40.7591833Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.7592429Z     
2025-05-07T20:32:40.7592751Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.7593234Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.7593764Z         x = x_sign * x_clamp
2025-05-07T20:32:40.7594171Z         x0 = x[:, :D]
2025-05-07T20:32:40.7594523Z         x1 = x[:, D:]
2025-05-07T20:32:40.7594879Z     
2025-05-07T20:32:40.7595186Z         if contiguous:
2025-05-07T20:32:40.7595694Z             x0 = x0.contiguous()
2025-05-07T20:32:40.7596121Z             x1 = x1.contiguous()
2025-05-07T20:32:40.7596506Z     
2025-05-07T20:32:40.7596805Z         if scale_ub is not None:
2025-05-07T20:32:40.7597247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.7597793Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.7598283Z             )
2025-05-07T20:32:40.7598602Z         else:
2025-05-07T20:32:40.7598957Z             scale_ub_tensor = None
2025-05-07T20:32:40.7599381Z     
2025-05-07T20:32:40.7599777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.7600322Z             op = silu_mul_quant
2025-05-07T20:32:40.7600744Z             if compiled:
2025-05-07T20:32:40.7601150Z                 op = torch.compile(op)
2025-05-07T20:32:40.7601651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7602123Z     
2025-05-07T20:32:40.7602438Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.7602738Z 
2025-05-07T20:32:40.7602908Z moe/activation_test.py:117: 
2025-05-07T20:32:40.7603415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7604133Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.7604625Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7605614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.7606624Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.7607846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.7609103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.7610065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.7611296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.7612639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.7613605Z     kernel = self.compile(
2025-05-07T20:32:40.7614567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.7615745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.7616438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7616845Z 
2025-05-07T20:32:40.7617253Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28715b490>
2025-05-07T20:32:40.7619221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.7621827Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28761a480>}
2025-05-07T20:32:40.7624225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.7625617Z context = <triton._C.libtriton.ir.context object at 0x7fb2871f6470>
2025-05-07T20:32:40.7626010Z 
2025-05-07T20:32:40.7626242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.7626949Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.7627588Z                            module_map=module_map)
2025-05-07T20:32:40.7628099Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.7628552Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.7628915Z E       ^
2025-05-07T20:32:40.7629602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.7630363Z 
2025-05-07T20:32:40.7630971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.7631752Z 
2025-05-07T20:32:40.7631916Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.7632516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.7633113Z     T=16384,
2025-05-07T20:32:40.7633390Z     D=7168,
2025-05-07T20:32:40.7633683Z     scale_ub=1200.0,
2025-05-07T20:32:40.7634003Z     contiguous=True,
2025-05-07T20:32:40.7634316Z     compiled=True,
2025-05-07T20:32:40.7634612Z )
2025-05-07T20:32:40.7635071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.7635810Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.7636213Z 
2025-05-07T20:32:40.7636327Z     @given(
2025-05-07T20:32:40.7636687Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.7637151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.7637601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.7638130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.7638886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.7639333Z     )
2025-05-07T20:32:40.7639893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.7640612Z     def test_silu_mul_quant(
2025-05-07T20:32:40.7640980Z         self,
2025-05-07T20:32:40.7641284Z         T: int,
2025-05-07T20:32:40.7641590Z         D: int,
2025-05-07T20:32:40.7641930Z         scale_ub: Optional[float],
2025-05-07T20:32:40.7642346Z         contiguous: bool,
2025-05-07T20:32:40.7642717Z         compiled: bool,
2025-05-07T20:32:40.7643066Z     ) -> None:
2025-05-07T20:32:40.7643395Z         torch.manual_seed(2025)
2025-05-07T20:32:40.7644032Z     
2025-05-07T20:32:40.7644470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.7645009Z     
2025-05-07T20:32:40.7645311Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.7645763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.7646245Z         x = x_sign * x_clamp
2025-05-07T20:32:40.7646621Z         x0 = x[:, :D]
2025-05-07T20:32:40.7646953Z         x1 = x[:, D:]
2025-05-07T20:32:40.7647273Z     
2025-05-07T20:32:40.7647568Z         if contiguous:
2025-05-07T20:32:40.7647935Z             x0 = x0.contiguous()
2025-05-07T20:32:40.7648339Z             x1 = x1.contiguous()
2025-05-07T20:32:40.7648713Z     
2025-05-07T20:32:40.7649011Z         if scale_ub is not None:
2025-05-07T20:32:40.7649435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.7649969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.7650575Z             )
2025-05-07T20:32:40.7650874Z         else:
2025-05-07T20:32:40.7651261Z             scale_ub_tensor = None
2025-05-07T20:32:40.7651656Z     
2025-05-07T20:32:40.7652010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.7652494Z             op = silu_mul_quant
2025-05-07T20:32:40.7652879Z             if compiled:
2025-05-07T20:32:40.7653252Z                 op = torch.compile(op)
2025-05-07T20:32:40.7653707Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7654143Z     
2025-05-07T20:32:40.7654443Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.7654705Z 
2025-05-07T20:32:40.7654855Z moe/activation_test.py:117: 
2025-05-07T20:32:40.7655319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7655849Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.7656290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7657196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.7658118Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.7659287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.7660421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.7661296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.7662418Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.7663516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.7664384Z     kernel = self.compile(
2025-05-07T20:32:40.7665265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.7666351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.7666979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7667369Z 
2025-05-07T20:32:40.7667692Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2872a0850>
2025-05-07T20:32:40.7669488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.7671784Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28761bd80>}
2025-05-07T20:32:40.7674029Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.7675727Z context = <triton._C.libtriton.ir.context object at 0x7fb287274eb0>
2025-05-07T20:32:40.7676201Z 
2025-05-07T20:32:40.7676580Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.7677484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.7678245Z                            module_map=module_map)
2025-05-07T20:32:40.7678811Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.7679365Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.7679768Z E       ^
2025-05-07T20:32:40.7680507Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.7681263Z 
2025-05-07T20:32:40.7681949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.7682813Z 
2025-05-07T20:32:40.8628812Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8629872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8630719Z     T=16384,
2025-05-07T20:32:40.8631043Z     D=5120,
2025-05-07T20:32:40.8631365Z     scale_ub=1200.0,
2025-05-07T20:32:40.8631731Z     contiguous=True,
2025-05-07T20:32:40.8632112Z     compiled=False,
2025-05-07T20:32:40.8632458Z )
2025-05-07T20:32:40.8632999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8633861Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:40.8634367Z 
2025-05-07T20:32:40.8634496Z     @given(
2025-05-07T20:32:40.8634871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8635399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8635924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8636488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8637053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8637555Z     )
2025-05-07T20:32:40.8638205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8639364Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8639779Z         self,
2025-05-07T20:32:40.8640094Z         T: int,
2025-05-07T20:32:40.8640423Z         D: int,
2025-05-07T20:32:40.8640787Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8641242Z         contiguous: bool,
2025-05-07T20:32:40.8641648Z         compiled: bool,
2025-05-07T20:32:40.8642023Z     ) -> None:
2025-05-07T20:32:40.8642386Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8642782Z     
2025-05-07T20:32:40.8643240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8643994Z     
2025-05-07T20:32:40.8644301Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8644776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8645289Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8645677Z         x0 = x[:, :D]
2025-05-07T20:32:40.8646029Z         x1 = x[:, D:]
2025-05-07T20:32:40.8646373Z     
2025-05-07T20:32:40.8646676Z         if contiguous:
2025-05-07T20:32:40.8647070Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8647517Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8647917Z     
2025-05-07T20:32:40.8648240Z         if scale_ub is not None:
2025-05-07T20:32:40.8648707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8649270Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8649803Z             )
2025-05-07T20:32:40.8650127Z         else:
2025-05-07T20:32:40.8650469Z             scale_ub_tensor = None
2025-05-07T20:32:40.8650898Z     
2025-05-07T20:32:40.8651288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8651829Z             op = silu_mul_quant
2025-05-07T20:32:40.8652242Z             if compiled:
2025-05-07T20:32:40.8652659Z                 op = torch.compile(op)
2025-05-07T20:32:40.8653164Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8653633Z     
2025-05-07T20:32:40.8654102Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8654389Z 
2025-05-07T20:32:40.8654569Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8655062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8655647Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8656131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8657405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8658695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8659651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8660891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8662172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8663209Z     kernel = self.compile(
2025-05-07T20:32:40.8664182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8665357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8666045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8666456Z 
2025-05-07T20:32:40.8666807Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28725da90>
2025-05-07T20:32:40.8668806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8671352Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2872c8cc0>}
2025-05-07T20:32:40.8673382Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8674931Z context = <triton._C.libtriton.ir.context object at 0x7fb2872320f0>
2025-05-07T20:32:40.8675331Z 
2025-05-07T20:32:40.8675562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8676279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8676985Z                            module_map=module_map)
2025-05-07T20:32:40.8677507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8678004Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8678399Z E       ^
2025-05-07T20:32:40.8679094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8679769Z 
2025-05-07T20:32:40.8680406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8681195Z 
2025-05-07T20:32:40.8681359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8681945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8682553Z     T=1,
2025-05-07T20:32:40.8682817Z     D=7168,
2025-05-07T20:32:40.8683109Z     scale_ub=1200.0,
2025-05-07T20:32:40.8683419Z     contiguous=False,
2025-05-07T20:32:40.8683899Z     compiled=False,
2025-05-07T20:32:40.8684223Z )
2025-05-07T20:32:40.8684723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8685515Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.8685947Z 
2025-05-07T20:32:40.8686072Z     @given(
2025-05-07T20:32:40.8686425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8687068Z         D=st.sampled_from([5120, 7168]),
﻿2025-05-07T20:32:40.8691206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8691727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8692253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8692709Z     )
2025-05-07T20:32:40.8693261Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8693977Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8704597Z         self,
2025-05-07T20:32:40.8704960Z         T: int,
2025-05-07T20:32:40.8705277Z         D: int,
2025-05-07T20:32:40.8705602Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8706050Z         contiguous: bool,
2025-05-07T20:32:40.8706420Z         compiled: bool,
2025-05-07T20:32:40.8706750Z     ) -> None:
2025-05-07T20:32:40.8707076Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8707482Z     
2025-05-07T20:32:40.8708037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8708655Z     
2025-05-07T20:32:40.8709003Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8709497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8710026Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8710433Z         x0 = x[:, :D]
2025-05-07T20:32:40.8710787Z         x1 = x[:, D:]
2025-05-07T20:32:40.8711124Z     
2025-05-07T20:32:40.8711430Z         if contiguous:
2025-05-07T20:32:40.8711814Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8712246Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8712635Z     
2025-05-07T20:32:40.8712956Z         if scale_ub is not None:
2025-05-07T20:32:40.8713420Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8713983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8714512Z             )
2025-05-07T20:32:40.8714840Z         else:
2025-05-07T20:32:40.8715181Z             scale_ub_tensor = None
2025-05-07T20:32:40.8715600Z     
2025-05-07T20:32:40.8715984Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8716584Z             op = silu_mul_quant
2025-05-07T20:32:40.8717002Z             if compiled:
2025-05-07T20:32:40.8717414Z                 op = torch.compile(op)
2025-05-07T20:32:40.8717909Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8718379Z     
2025-05-07T20:32:40.8718693Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8718974Z 
2025-05-07T20:32:40.8719142Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8719635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8720198Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8720668Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8721902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8723150Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8724242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8725487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8726675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8727688Z     kernel = self.compile(
2025-05-07T20:32:40.8728662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8729839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8730535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8730949Z 
2025-05-07T20:32:40.8731304Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28703d850>
2025-05-07T20:32:40.8733353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8735996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2872c9080>}
2025-05-07T20:32:40.8738755Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8740650Z context = <triton._C.libtriton.ir.context object at 0x7fb2870a9370>
2025-05-07T20:32:40.8741158Z 
2025-05-07T20:32:40.8741457Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8742388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8743336Z                            module_map=module_map)
2025-05-07T20:32:40.8743975Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8744589Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8745027Z E       ^
2025-05-07T20:32:40.8745847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8746670Z 
2025-05-07T20:32:40.8747434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8748370Z 
2025-05-07T20:32:41.0080475Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0081241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0081919Z     T=4096,
2025-05-07T20:32:41.0082208Z     D=7168,
2025-05-07T20:32:41.0082512Z     scale_ub=1200.0,
2025-05-07T20:32:41.0082865Z     contiguous=False,
2025-05-07T20:32:41.0083225Z     compiled=True,
2025-05-07T20:32:41.0083721Z )
2025-05-07T20:32:41.0084245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0085334Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.0085815Z 
2025-05-07T20:32:41.0085934Z     @given(
2025-05-07T20:32:41.0086292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0086794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0087285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0087833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0088370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0088841Z     )
2025-05-07T20:32:41.0089418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0090145Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0090536Z         self,
2025-05-07T20:32:41.0090850Z         T: int,
2025-05-07T20:32:41.0091167Z         D: int,
2025-05-07T20:32:41.0091510Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0091966Z         contiguous: bool,
2025-05-07T20:32:41.0092353Z         compiled: bool,
2025-05-07T20:32:41.0092698Z     ) -> None:
2025-05-07T20:32:41.0093036Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0093424Z     
2025-05-07T20:32:41.0093848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0094417Z     
2025-05-07T20:32:41.0094719Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0095172Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0095677Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0096040Z         x0 = x[:, :D]
2025-05-07T20:32:41.0096375Z         x1 = x[:, D:]
2025-05-07T20:32:41.0096714Z     
2025-05-07T20:32:41.0097002Z         if contiguous:
2025-05-07T20:32:41.0097366Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0097798Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0098197Z     
2025-05-07T20:32:41.0098516Z         if scale_ub is not None:
2025-05-07T20:32:41.0099050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0099665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0100106Z             )
2025-05-07T20:32:41.0100376Z         else:
2025-05-07T20:32:41.0100673Z             scale_ub_tensor = None
2025-05-07T20:32:41.0101044Z     
2025-05-07T20:32:41.0101382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0101867Z             op = silu_mul_quant
2025-05-07T20:32:41.0102254Z             if compiled:
2025-05-07T20:32:41.0102631Z                 op = torch.compile(op)
2025-05-07T20:32:41.0103075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0103490Z     
2025-05-07T20:32:41.0103769Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0104028Z 
2025-05-07T20:32:41.0104184Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0104659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0105356Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0105808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0106756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0107715Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0108871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0110097Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0111064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0112286Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0113453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0114417Z     kernel = self.compile(
2025-05-07T20:32:41.0115377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0116638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0117316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0117759Z 
2025-05-07T20:32:41.0118086Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287083b10>
2025-05-07T20:32:41.0119966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0122321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2872cb060>}
2025-05-07T20:32:41.0124791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0126675Z context = <triton._C.libtriton.ir.context object at 0x7fb287038170>
2025-05-07T20:32:41.0127173Z 
2025-05-07T20:32:41.0127452Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0128378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0129202Z                            module_map=module_map)
2025-05-07T20:32:41.0129799Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0130396Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0130829Z E       ^
2025-05-07T20:32:41.0131629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0132444Z 
2025-05-07T20:32:41.0133312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0134263Z 
2025-05-07T20:32:41.0134524Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0135195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0135877Z     T=128,
2025-05-07T20:32:41.0136187Z     D=7168,
2025-05-07T20:32:41.0136496Z     scale_ub=1200.0,
2025-05-07T20:32:41.0136860Z     contiguous=False,
2025-05-07T20:32:41.0137223Z     compiled=True,
2025-05-07T20:32:41.0137564Z )
2025-05-07T20:32:41.0859440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0860376Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.0860868Z 
2025-05-07T20:32:41.0860995Z     @given(
2025-05-07T20:32:41.0861382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0861919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0862804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0863388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0863970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0864456Z     )
2025-05-07T20:32:41.0865062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0865844Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0866243Z         self,
2025-05-07T20:32:41.0866568Z         T: int,
2025-05-07T20:32:41.0866904Z         D: int,
2025-05-07T20:32:41.0867299Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0867759Z         contiguous: bool,
2025-05-07T20:32:41.0868165Z         compiled: bool,
2025-05-07T20:32:41.0868537Z     ) -> None:
2025-05-07T20:32:41.0868883Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0869284Z     
2025-05-07T20:32:41.0869739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0870327Z     
2025-05-07T20:32:41.0870661Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0871166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0871841Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0872276Z         x0 = x[:, :D]
2025-05-07T20:32:41.0872646Z         x1 = x[:, D:]
2025-05-07T20:32:41.0872990Z     
2025-05-07T20:32:41.0873305Z         if contiguous:
2025-05-07T20:32:41.0873682Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0874087Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0874481Z     
2025-05-07T20:32:41.0874784Z         if scale_ub is not None:
2025-05-07T20:32:41.0875219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0875756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0876265Z             )
2025-05-07T20:32:41.0876569Z         else:
2025-05-07T20:32:41.0876917Z             scale_ub_tensor = None
2025-05-07T20:32:41.0877344Z     
2025-05-07T20:32:41.0877721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0878272Z             op = silu_mul_quant
2025-05-07T20:32:41.0878696Z             if compiled:
2025-05-07T20:32:41.0879110Z                 op = torch.compile(op)
2025-05-07T20:32:41.0879615Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0880086Z     
2025-05-07T20:32:41.0880393Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0880680Z 
2025-05-07T20:32:41.0880842Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0881341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0881915Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0882387Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0883367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0884549Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0885716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0887107Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0888073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0889443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0890632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0891584Z     kernel = self.compile(
2025-05-07T20:32:41.0892546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0893727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0894407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0894817Z 
2025-05-07T20:32:41.0895167Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2870d4ed0>
2025-05-07T20:32:41.0897195Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0899742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286e48360>}
2025-05-07T20:32:41.0902142Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0903557Z context = <triton._C.libtriton.ir.context object at 0x7fb286edd4b0>
2025-05-07T20:32:41.0903963Z 
2025-05-07T20:32:41.0904186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0904926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0905563Z                            module_map=module_map)
2025-05-07T20:32:41.0906145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0906613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0906980Z E       ^
2025-05-07T20:32:41.0907714Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0908371Z 
2025-05-07T20:32:41.0908978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0909755Z 
2025-05-07T20:32:41.0909919Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0910511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0911109Z     T=2048,
2025-05-07T20:32:41.0911394Z     D=7168,
2025-05-07T20:32:41.0911674Z     scale_ub=None,
2025-05-07T20:32:41.0911993Z     contiguous=True,
2025-05-07T20:32:41.0912325Z     compiled=True,
2025-05-07T20:32:41.0912635Z )
2025-05-07T20:32:41.0913112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0913846Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.0914209Z 
2025-05-07T20:32:41.0914324Z     @given(
2025-05-07T20:32:41.0914631Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0915072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0915492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0915930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0916397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0916801Z     )
2025-05-07T20:32:41.0917294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0917926Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0918261Z         self,
2025-05-07T20:32:41.0918536Z         T: int,
2025-05-07T20:32:41.0918802Z         D: int,
2025-05-07T20:32:41.0919207Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0919645Z         contiguous: bool,
2025-05-07T20:32:41.0919970Z         compiled: bool,
2025-05-07T20:32:41.0920279Z     ) -> None:
2025-05-07T20:32:41.0920573Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0920903Z     
2025-05-07T20:32:41.0921283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0921769Z     
2025-05-07T20:32:41.0922028Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0922437Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0922876Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0923206Z         x0 = x[:, :D]
2025-05-07T20:32:41.0923505Z         x1 = x[:, D:]
2025-05-07T20:32:41.0923928Z     
2025-05-07T20:32:41.0924179Z         if contiguous:
2025-05-07T20:32:41.0924508Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0924925Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0925269Z     
2025-05-07T20:32:41.0925534Z         if scale_ub is not None:
2025-05-07T20:32:41.0925923Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0926404Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0926846Z             )
2025-05-07T20:32:41.0927112Z         else:
2025-05-07T20:32:41.0927406Z             scale_ub_tensor = None
2025-05-07T20:32:41.0927753Z     
2025-05-07T20:32:41.0928076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0928522Z             op = silu_mul_quant
2025-05-07T20:32:41.0928862Z             if compiled:
2025-05-07T20:32:41.0929206Z                 op = torch.compile(op)
2025-05-07T20:32:41.0929628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0930026Z     
2025-05-07T20:32:41.0930307Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0930541Z 
2025-05-07T20:32:41.0930691Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0931119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0931592Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0932064Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0932883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0933687Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0934640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0935642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0936430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0937415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0938378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0939384Z     kernel = self.compile(
2025-05-07T20:32:41.0940162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0941121Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0941687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0942014Z 
2025-05-07T20:32:41.0942308Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286eee490>
2025-05-07T20:32:41.0943875Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0945898Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286e48ea0>}
2025-05-07T20:32:41.0948025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0949609Z context = <triton._C.libtriton.ir.context object at 0x7fb286e82af0>
2025-05-07T20:32:41.0950019Z 
2025-05-07T20:32:41.0950267Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0951012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0951682Z                            module_map=module_map)
2025-05-07T20:32:41.0952188Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0952660Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0953037Z E       ^
2025-05-07T20:32:41.0953697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0954450Z 
2025-05-07T20:32:41.0955073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0955828Z 
2025-05-07T20:32:41.1605308Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.1606048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.1606740Z     T=16384,
2025-05-07T20:32:41.1607043Z     D=5120,
2025-05-07T20:32:41.1607339Z     scale_ub=None,
2025-05-07T20:32:41.1607734Z     contiguous=False,
2025-05-07T20:32:41.1608102Z     compiled=False,
2025-05-07T20:32:41.1608440Z )
2025-05-07T20:32:41.1608952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.1609805Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.1610282Z 
2025-05-07T20:32:41.1610413Z     @given(
2025-05-07T20:32:41.1610778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.1611310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.1611831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.1612606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.1613175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.1613645Z     )
2025-05-07T20:32:41.1614238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.1614992Z     def test_silu_mul_quant(
2025-05-07T20:32:41.1615393Z         self,
2025-05-07T20:32:41.1615711Z         T: int,
2025-05-07T20:32:41.1616023Z         D: int,
2025-05-07T20:32:41.1616384Z         scale_ub: Optional[float],
2025-05-07T20:32:41.1616841Z         contiguous: bool,
2025-05-07T20:32:41.1617229Z         compiled: bool,
2025-05-07T20:32:41.1617598Z     ) -> None:
2025-05-07T20:32:41.1617958Z         torch.manual_seed(2025)
2025-05-07T20:32:41.1618357Z     
2025-05-07T20:32:41.1618823Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.1619425Z     
2025-05-07T20:32:41.1619743Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.1620239Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.1623948Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.1627420Z 
2025-05-07T20:32:41.1627626Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.1627995Z 
2025-05-07T20:32:41.1628162Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.1628829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.1629619Z     T=4096,
2025-05-07T20:32:41.1630033Z     D=7168,
2025-05-07T20:32:41.1630328Z     scale_ub=1200.0,
2025-05-07T20:32:41.1630670Z     contiguous=True,
2025-05-07T20:32:41.1631014Z     compiled=True,
2025-05-07T20:32:41.1631330Z )
2025-05-07T20:32:41.1631847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.1632673Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.1633142Z 
2025-05-07T20:32:41.1633275Z     @given(
2025-05-07T20:32:41.1633644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.1634180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.1634683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.1635235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.1635800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.1636423Z     )
2025-05-07T20:32:41.1637015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.1637781Z     def test_silu_mul_quant(
2025-05-07T20:32:41.1638190Z         self,
2025-05-07T20:32:41.1650506Z         T: int,
2025-05-07T20:32:41.1650884Z         D: int,
2025-05-07T20:32:41.1651247Z         scale_ub: Optional[float],
2025-05-07T20:32:41.1651713Z         contiguous: bool,
2025-05-07T20:32:41.1652093Z         compiled: bool,
2025-05-07T20:32:41.1652481Z     ) -> None:
2025-05-07T20:32:41.1652845Z         torch.manual_seed(2025)
2025-05-07T20:32:41.1653267Z     
2025-05-07T20:32:41.1653722Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.1654316Z     
2025-05-07T20:32:41.1654633Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.1655119Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.1658838Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.1662517Z 
2025-05-07T20:32:41.1662717Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.1663097Z 
2025-05-07T20:32:41.1663267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.1663983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.1664686Z     T=16384,
2025-05-07T20:32:41.1665011Z     D=7168,
2025-05-07T20:32:41.1665325Z     scale_ub=None,
2025-05-07T20:32:41.1665678Z     contiguous=False,
2025-05-07T20:32:41.1666051Z     compiled=False,
2025-05-07T20:32:41.1666361Z )
2025-05-07T20:32:41.1666871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.1667687Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.1668159Z 
2025-05-07T20:32:41.1668280Z     @given(
2025-05-07T20:32:41.1668713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.1669212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.1669733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.1670299Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.1670853Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.1671341Z     )
2025-05-07T20:32:41.1671944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.1672723Z     def test_silu_mul_quant(
2025-05-07T20:32:41.1673113Z         self,
2025-05-07T20:32:41.1673433Z         T: int,
2025-05-07T20:32:41.1673768Z         D: int,
2025-05-07T20:32:41.1674261Z         scale_ub: Optional[float],
2025-05-07T20:32:41.1674730Z         contiguous: bool,
2025-05-07T20:32:41.1675255Z         compiled: bool,
2025-05-07T20:32:41.1675616Z     ) -> None:
2025-05-07T20:32:41.1675963Z         torch.manual_seed(2025)
2025-05-07T20:32:41.1676371Z     
2025-05-07T20:32:41.1676816Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.1680691Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.1684379Z 
2025-05-07T20:32:41.1684582Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.1684966Z 
2025-05-07T20:32:41.1685135Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.1685850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.1686543Z     T=2048,
2025-05-07T20:32:41.1686856Z     D=7168,
2025-05-07T20:32:41.1687168Z     scale_ub=1200.0,
2025-05-07T20:32:41.1687558Z     contiguous=True,
2025-05-07T20:32:41.1687948Z     compiled=True,
2025-05-07T20:32:41.1688289Z )
2025-05-07T20:32:41.1688818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.1689671Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.1690149Z 
2025-05-07T20:32:41.1690286Z     @given(
2025-05-07T20:32:41.1690666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.1691199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.1691721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.1692359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.1692917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.1693403Z     )
2025-05-07T20:32:41.1694002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.1694699Z     def test_silu_mul_quant(
2025-05-07T20:32:41.1695022Z         self,
2025-05-07T20:32:41.1695273Z         T: int,
2025-05-07T20:32:41.1695521Z         D: int,
2025-05-07T20:32:41.1695800Z         scale_ub: Optional[float],
2025-05-07T20:32:41.1696148Z         contiguous: bool,
2025-05-07T20:32:41.1696494Z         compiled: bool,
2025-05-07T20:32:41.1696810Z     ) -> None:
2025-05-07T20:32:41.1697148Z         torch.manual_seed(2025)
2025-05-07T20:32:41.1697509Z     
2025-05-07T20:32:41.1697881Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.1698367Z     
2025-05-07T20:32:41.1698653Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.1699068Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.1702289Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.1705487Z 
2025-05-07T20:32:41.1705700Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.1706051Z 
2025-05-07T20:32:41.1706231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.1706937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.1707715Z     T=2048,
2025-05-07T20:32:41.1708039Z     D=7168,
2025-05-07T20:32:41.1708422Z     scale_ub=None,
2025-05-07T20:32:41.1708776Z     contiguous=True,
2025-05-07T20:32:41.1709146Z     compiled=False,
2025-05-07T20:32:41.1709470Z )
2025-05-07T20:32:41.2584308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2584871Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.2585149Z 
2025-05-07T20:32:41.2585229Z     @given(
2025-05-07T20:32:41.2585471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2585793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2586103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2586429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2586760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2587050Z     )
2025-05-07T20:32:41.2587624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2588094Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2588353Z         self,
2025-05-07T20:32:41.2588547Z         T: int,
2025-05-07T20:32:41.2588747Z         D: int,
2025-05-07T20:32:41.2588972Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2589246Z         contiguous: bool,
2025-05-07T20:32:41.2589496Z         compiled: bool,
2025-05-07T20:32:41.2589732Z     ) -> None:
2025-05-07T20:32:41.2589946Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2590192Z     
2025-05-07T20:32:41.2590468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2590807Z     
2025-05-07T20:32:41.2591006Z >       x_sign = torch.sign(x)
2025-05-07T20:32:41.2592982Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.2594971Z 
2025-05-07T20:32:41.2595092Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:41.2595312Z 
2025-05-07T20:32:41.2595418Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2595839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2596243Z     T=1,
2025-05-07T20:32:41.2596440Z     D=7168,
2025-05-07T20:32:41.2596644Z     scale_ub=1200.0,
2025-05-07T20:32:41.2596866Z     contiguous=True,
2025-05-07T20:32:41.2597093Z     compiled=False,
2025-05-07T20:32:41.2597310Z )
2025-05-07T20:32:41.2597630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2598127Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.2598404Z 
2025-05-07T20:32:41.2598481Z     @given(
2025-05-07T20:32:41.2598713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2599031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2599339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2599671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2599997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2600287Z     )
2025-05-07T20:32:41.2600640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2601079Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2601326Z         self,
2025-05-07T20:32:41.2601521Z         T: int,
2025-05-07T20:32:41.2601716Z         D: int,
2025-05-07T20:32:41.2601940Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2602216Z         contiguous: bool,
2025-05-07T20:32:41.2602541Z         compiled: bool,
2025-05-07T20:32:41.2602776Z     ) -> None:
2025-05-07T20:32:41.2603099Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2603343Z     
2025-05-07T20:32:41.2603762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2604107Z     
2025-05-07T20:32:41.2604310Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2604601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2604912Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2605157Z         x0 = x[:, :D]
2025-05-07T20:32:41.2605368Z         x1 = x[:, D:]
2025-05-07T20:32:41.2605581Z     
2025-05-07T20:32:41.2605771Z         if contiguous:
2025-05-07T20:32:41.2606000Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2606267Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2606512Z     
2025-05-07T20:32:41.2606707Z         if scale_ub is not None:
2025-05-07T20:32:41.2607037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2607381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2607698Z             )
2025-05-07T20:32:41.2607895Z         else:
2025-05-07T20:32:41.2608112Z             scale_ub_tensor = None
2025-05-07T20:32:41.2608368Z     
2025-05-07T20:32:41.2608598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2608914Z             op = silu_mul_quant
2025-05-07T20:32:41.2609170Z             if compiled:
2025-05-07T20:32:41.2609416Z                 op = torch.compile(op)
2025-05-07T20:32:41.2609716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2609996Z     
2025-05-07T20:32:41.2610185Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2610355Z 
2025-05-07T20:32:41.2610458Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2610757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2611086Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2611374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2612073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2612832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2613373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2614067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2614763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2615297Z     kernel = self.compile(
2025-05-07T20:32:41.2615847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2616516Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2616915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2617151Z 
2025-05-07T20:32:41.2617367Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286cfc410>
2025-05-07T20:32:41.2618500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2619874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286c68680>}
2025-05-07T20:32:41.2621218Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2622240Z context = <triton._C.libtriton.ir.context object at 0x7fb286c303b0>
2025-05-07T20:32:41.2622536Z 
2025-05-07T20:32:41.2622769Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2623306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2623828Z                            module_map=module_map)
2025-05-07T20:32:41.2624193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2624558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2624831Z E       ^
2025-05-07T20:32:41.2625299Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2625763Z 
2025-05-07T20:32:41.2626186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2626712Z 
2025-05-07T20:32:41.2626818Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2627277Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2627679Z     T=128,
2025-05-07T20:32:41.2627877Z     D=5120,
2025-05-07T20:32:41.2628076Z     scale_ub=None,
2025-05-07T20:32:41.2628290Z     contiguous=True,
2025-05-07T20:32:41.2628518Z     compiled=False,
2025-05-07T20:32:41.2628728Z )
2025-05-07T20:32:41.5072169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5072729Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.5073001Z 
2025-05-07T20:32:41.5073119Z     @given(
2025-05-07T20:32:41.5073444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5073882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5074198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5074533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5074861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5075151Z     )
2025-05-07T20:32:41.5075524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5075969Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5076512Z         self,
2025-05-07T20:32:41.5076716Z         T: int,
2025-05-07T20:32:41.5076910Z         D: int,
2025-05-07T20:32:41.5077133Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5077458Z         contiguous: bool,
2025-05-07T20:32:41.5077698Z         compiled: bool,
2025-05-07T20:32:41.5077929Z     ) -> None:
2025-05-07T20:32:41.5078148Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5078389Z     
2025-05-07T20:32:41.5078665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5079013Z     
2025-05-07T20:32:41.5079211Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5079498Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5079812Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5080052Z         x0 = x[:, :D]
2025-05-07T20:32:41.5080262Z         x1 = x[:, D:]
2025-05-07T20:32:41.5080478Z     
2025-05-07T20:32:41.5080668Z         if contiguous:
2025-05-07T20:32:41.5080902Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5081168Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5081407Z     
2025-05-07T20:32:41.5081595Z         if scale_ub is not None:
2025-05-07T20:32:41.5081867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5082202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5082504Z             )
2025-05-07T20:32:41.5082702Z         else:
2025-05-07T20:32:41.5082918Z             scale_ub_tensor = None
2025-05-07T20:32:41.5083165Z     
2025-05-07T20:32:41.5083398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5083850Z             op = silu_mul_quant
2025-05-07T20:32:41.5084102Z             if compiled:
2025-05-07T20:32:41.5084347Z                 op = torch.compile(op)
2025-05-07T20:32:41.5084644Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5084921Z     
2025-05-07T20:32:41.5085111Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5085282Z 
2025-05-07T20:32:41.5085483Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5085885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5086215Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5086496Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5087191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5087890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5088427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5089116Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5089789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5090403Z     kernel = self.compile(
2025-05-07T20:32:41.5090958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5091627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5092026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5092263Z 
2025-05-07T20:32:41.5092471Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286c871d0>
2025-05-07T20:32:41.5093550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5094938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286c698a0>}
2025-05-07T20:32:41.5096289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5097367Z context = <triton._C.libtriton.ir.context object at 0x7fb286c3f7f0>
2025-05-07T20:32:41.5097655Z 
2025-05-07T20:32:41.5097824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5098354Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5098827Z                            module_map=module_map)
2025-05-07T20:32:41.5099189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5099551Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5099816Z E       ^
2025-05-07T20:32:41.5100290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5100745Z 
2025-05-07T20:32:41.5101174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5101705Z 
2025-05-07T20:32:41.5101810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5102230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5102630Z     T=128,
2025-05-07T20:32:41.5102826Z     D=7168,
2025-05-07T20:32:41.5103022Z     scale_ub=None,
2025-05-07T20:32:41.5103237Z     contiguous=True,
2025-05-07T20:32:41.5103453Z     compiled=False,
2025-05-07T20:32:41.5103661Z )
2025-05-07T20:32:41.5103980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5104466Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.5104737Z 
2025-05-07T20:32:41.5104817Z     @given(
2025-05-07T20:32:41.5105045Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5105357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5105750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5106086Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5106456Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5106743Z     )
2025-05-07T20:32:41.5107092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5107531Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5107784Z         self,
2025-05-07T20:32:41.5108011Z         T: int,
2025-05-07T20:32:41.5108210Z         D: int,
2025-05-07T20:32:41.5108424Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5108700Z         contiguous: bool,
2025-05-07T20:32:41.5108940Z         compiled: bool,
2025-05-07T20:32:41.5109164Z     ) -> None:
2025-05-07T20:32:41.5109381Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5109625Z     
2025-05-07T20:32:41.5109920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5110302Z     
2025-05-07T20:32:41.5110504Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5110803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5111111Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5111354Z         x0 = x[:, :D]
2025-05-07T20:32:41.5111573Z         x1 = x[:, D:]
2025-05-07T20:32:41.5111777Z     
2025-05-07T20:32:41.5111971Z         if contiguous:
2025-05-07T20:32:41.5112209Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5112462Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5112703Z     
2025-05-07T20:32:41.5112899Z         if scale_ub is not None:
2025-05-07T20:32:41.5113173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5113501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5113817Z             )
2025-05-07T20:32:41.5114014Z         else:
2025-05-07T20:32:41.5114219Z             scale_ub_tensor = None
2025-05-07T20:32:41.5114475Z     
2025-05-07T20:32:41.5114714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5115027Z             op = silu_mul_quant
2025-05-07T20:32:41.5115332Z             if compiled:
2025-05-07T20:32:41.5115580Z                 op = torch.compile(op)
2025-05-07T20:32:41.5115873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5116150Z     
2025-05-07T20:32:41.5116344Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5116508Z 
2025-05-07T20:32:41.5116607Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5116902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5117233Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5117512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5118195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5118890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5119432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5120120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5120792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5121330Z     kernel = self.compile(
2025-05-07T20:32:41.5121877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5122530Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5122935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5123167Z 
2025-05-07T20:32:41.5123380Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286f962d0>
2025-05-07T20:32:41.5124596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5126006Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286c6a7a0>}
2025-05-07T20:32:41.5127357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5128386Z context = <triton._C.libtriton.ir.context object at 0x7fb286f1a8f0>
2025-05-07T20:32:41.5128672Z 
2025-05-07T20:32:41.5128846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5129366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5129840Z                            module_map=module_map)
2025-05-07T20:32:41.5130249Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5130611Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5130873Z E       ^
2025-05-07T20:32:41.5131353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5144160Z 
2025-05-07T20:32:41.5144598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5145130Z 
2025-05-07T20:32:41.5145237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5145658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5146055Z     T=2048,
2025-05-07T20:32:41.5146250Z     D=7168,
2025-05-07T20:32:41.5146442Z     scale_ub=1200.0,
2025-05-07T20:32:41.5146662Z     contiguous=True,
2025-05-07T20:32:41.5146886Z     compiled=False,
2025-05-07T20:32:41.5147095Z )
2025-05-07T20:32:41.5815953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5816498Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.5817319Z 
2025-05-07T20:32:41.5817487Z     @given(
2025-05-07T20:32:41.5817965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5818573Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5819179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5819824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5820465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5821020Z     )
2025-05-07T20:32:41.5821709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5822586Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5823056Z         self,
2025-05-07T20:32:41.5823442Z         T: int,
2025-05-07T20:32:41.5823829Z         D: int,
2025-05-07T20:32:41.5824254Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5824795Z         contiguous: bool,
2025-05-07T20:32:41.5825283Z         compiled: bool,
2025-05-07T20:32:41.5825728Z     ) -> None:
2025-05-07T20:32:41.5826164Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5826639Z     
2025-05-07T20:32:41.5827152Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5829248Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.5831122Z 
2025-05-07T20:32:41.5831257Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.5831473Z 
2025-05-07T20:32:41.5831695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5832203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5832615Z     T=1,
2025-05-07T20:32:41.5832809Z     D=5120,
2025-05-07T20:32:41.5833001Z     scale_ub=1200.0,
2025-05-07T20:32:41.5833232Z     contiguous=True,
2025-05-07T20:32:41.5833462Z     compiled=False,
2025-05-07T20:32:41.5833665Z )
2025-05-07T20:32:41.5833985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5834481Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.5834745Z 
2025-05-07T20:32:41.5834827Z     @given(
2025-05-07T20:32:41.5835065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5835386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5835687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5836107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5836446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5836743Z     )
2025-05-07T20:32:41.5837100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5837548Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5837797Z         self,
2025-05-07T20:32:41.5837988Z         T: int,
2025-05-07T20:32:41.5838195Z         D: int,
2025-05-07T20:32:41.5838695Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5838975Z         contiguous: bool,
2025-05-07T20:32:41.5839231Z         compiled: bool,
2025-05-07T20:32:41.5839462Z     ) -> None:
2025-05-07T20:32:41.5839678Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5839926Z     
2025-05-07T20:32:41.5840205Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5840546Z     
2025-05-07T20:32:41.5840746Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5841048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5841368Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5841690Z         x0 = x[:, :D]
2025-05-07T20:32:41.5841920Z         x1 = x[:, D:]
2025-05-07T20:32:41.5842135Z     
2025-05-07T20:32:41.5842326Z         if contiguous:
2025-05-07T20:32:41.5842570Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5842833Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5843078Z     
2025-05-07T20:32:41.5843277Z         if scale_ub is not None:
2025-05-07T20:32:41.5843639Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5843974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5844289Z             )
2025-05-07T20:32:41.5844489Z         else:
2025-05-07T20:32:41.5844696Z             scale_ub_tensor = None
2025-05-07T20:32:41.5844959Z     
2025-05-07T20:32:41.5845198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5845513Z             op = silu_mul_quant
2025-05-07T20:32:41.5845769Z             if compiled:
2025-05-07T20:32:41.5846030Z                 op = torch.compile(op)
2025-05-07T20:32:41.5846329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5846616Z     
2025-05-07T20:32:41.5846815Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5846980Z 
2025-05-07T20:32:41.5847082Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5847388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5847727Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5848017Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5848706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5849415Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5849959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5850650Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5851399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5852008Z     kernel = self.compile(
2025-05-07T20:32:41.5852554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5853210Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5853611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5853840Z 
2025-05-07T20:32:41.5854053Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286f2e210>
2025-05-07T20:32:41.5855132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5856564Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286c6bb00>}
2025-05-07T20:32:41.5857925Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5858953Z context = <triton._C.libtriton.ir.context object at 0x7fb286f16830>
2025-05-07T20:32:41.5859240Z 
2025-05-07T20:32:41.5859415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5859934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5860406Z                            module_map=module_map)
2025-05-07T20:32:41.5860776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5861138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5861396Z E       ^
2025-05-07T20:32:41.5861870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5862371Z 
2025-05-07T20:32:41.5862800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5863318Z 
2025-05-07T20:32:41.5863427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5863841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5864248Z     T=2048,
2025-05-07T20:32:41.5864441Z     D=5120,
2025-05-07T20:32:41.5864628Z     scale_ub=None,
2025-05-07T20:32:41.5864851Z     contiguous=True,
2025-05-07T20:32:41.5865083Z     compiled=False,
2025-05-07T20:32:41.5865283Z )
2025-05-07T20:32:41.5865605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5866106Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.5866376Z 
2025-05-07T20:32:41.5866453Z     @given(
2025-05-07T20:32:41.5866684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5866998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5867304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5867626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5867988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5868295Z     )
2025-05-07T20:32:41.5868638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5869083Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5869324Z         self,
2025-05-07T20:32:41.5869516Z         T: int,
2025-05-07T20:32:41.5869716Z         D: int,
2025-05-07T20:32:41.5869932Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5870202Z         contiguous: bool,
2025-05-07T20:32:41.5870442Z         compiled: bool,
2025-05-07T20:32:41.5870666Z     ) -> None:
2025-05-07T20:32:41.5870878Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5871171Z     
2025-05-07T20:32:41.5871450Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5871857Z     
2025-05-07T20:32:41.5872053Z >       x_sign = torch.sign(x)
2025-05-07T20:32:41.5874005Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.5875864Z 
2025-05-07T20:32:41.5875982Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:41.5876234Z 
2025-05-07T20:32:41.5876346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5876757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5877164Z     T=16384,
2025-05-07T20:32:41.5877357Z     D=5120,
2025-05-07T20:32:41.5877544Z     scale_ub=None,
2025-05-07T20:32:41.5877757Z     contiguous=True,
2025-05-07T20:32:41.5877984Z     compiled=False,
2025-05-07T20:32:41.5878184Z )
2025-05-07T20:32:41.6578572Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6579117Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.6579405Z 
2025-05-07T20:32:41.6579489Z     @given(
2025-05-07T20:32:41.6579722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6580038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6580341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6580682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6581021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6581472Z     )
2025-05-07T20:32:41.6581825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6582273Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6582511Z         self,
2025-05-07T20:32:41.6582713Z         T: int,
2025-05-07T20:32:41.6582915Z         D: int,
2025-05-07T20:32:41.6583135Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6583408Z         contiguous: bool,
2025-05-07T20:32:41.6583647Z         compiled: bool,
2025-05-07T20:32:41.6583876Z     ) -> None:
2025-05-07T20:32:41.6584087Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6584329Z     
2025-05-07T20:32:41.6584606Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6586647Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.6588564Z 
2025-05-07T20:32:41.6588682Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.6588912Z 
2025-05-07T20:32:41.6589015Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6589427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6589822Z     T=4096,
2025-05-07T20:32:41.6590014Z     D=5120,
2025-05-07T20:32:41.6590209Z     scale_ub=None,
2025-05-07T20:32:41.6590418Z     contiguous=True,
2025-05-07T20:32:41.6590647Z     compiled=False,
2025-05-07T20:32:41.6590858Z )
2025-05-07T20:32:41.6591174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6591755Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.6592119Z 
2025-05-07T20:32:41.6592198Z     @given(
2025-05-07T20:32:41.6592431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6592737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6593045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6593373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6593695Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6593987Z     )
2025-05-07T20:32:41.6594338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6594772Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6595016Z         self,
2025-05-07T20:32:41.6595211Z         T: int,
2025-05-07T20:32:41.6595402Z         D: int,
2025-05-07T20:32:41.6595695Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6595975Z         contiguous: bool,
2025-05-07T20:32:41.6596215Z         compiled: bool,
2025-05-07T20:32:41.6596444Z     ) -> None:
2025-05-07T20:32:41.6596661Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6596904Z     
2025-05-07T20:32:41.6597173Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6599210Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.6601056Z 
2025-05-07T20:32:41.6601179Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.6601394Z 
2025-05-07T20:32:41.6601550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6601962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6602366Z     T=2048,
2025-05-07T20:32:41.6602560Z     D=5120,
2025-05-07T20:32:41.6602755Z     scale_ub=None,
2025-05-07T20:32:41.6602967Z     contiguous=False,
2025-05-07T20:32:41.6603194Z     compiled=False,
2025-05-07T20:32:41.6603401Z )
2025-05-07T20:32:41.6603852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6604347Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.6604620Z 
2025-05-07T20:32:41.6604706Z     @given(
2025-05-07T20:32:41.6604932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6605248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6605555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6605879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6606211Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6606499Z     )
2025-05-07T20:32:41.6606852Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6607290Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6607536Z         self,
2025-05-07T20:32:41.6607737Z         T: int,
2025-05-07T20:32:41.6607929Z         D: int,
2025-05-07T20:32:41.6608151Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6608428Z         contiguous: bool,
2025-05-07T20:32:41.6608665Z         compiled: bool,
2025-05-07T20:32:41.6608895Z     ) -> None:
2025-05-07T20:32:41.6609120Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6609357Z     
2025-05-07T20:32:41.6609636Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6611721Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.6613666Z 
2025-05-07T20:32:41.6613787Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.6613999Z 
2025-05-07T20:32:41.6614112Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6614521Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6614927Z     T=4096,
2025-05-07T20:32:41.6615123Z     D=7168,
2025-05-07T20:32:41.6615311Z     scale_ub=None,
2025-05-07T20:32:41.6615528Z     contiguous=True,
2025-05-07T20:32:41.6615794Z     compiled=True,
2025-05-07T20:32:41.6615997Z )
2025-05-07T20:32:41.6616318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6616811Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.6617079Z 
2025-05-07T20:32:41.6617166Z     @given(
2025-05-07T20:32:41.6617390Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6617709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6618021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6618345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6618674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6618962Z     )
2025-05-07T20:32:41.6619312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6619760Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6620007Z         self,
2025-05-07T20:32:41.6620197Z         T: int,
2025-05-07T20:32:41.6620404Z         D: int,
2025-05-07T20:32:41.6620628Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6620945Z         contiguous: bool,
2025-05-07T20:32:41.6621190Z         compiled: bool,
2025-05-07T20:32:41.6621419Z     ) -> None:
2025-05-07T20:32:41.6621637Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6621872Z     
2025-05-07T20:32:41.6622146Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6624191Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.6626045Z 
2025-05-07T20:32:41.6626171Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.6626388Z 
2025-05-07T20:32:41.6626491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6626906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6627310Z     T=2048,
2025-05-07T20:32:41.6627505Z     D=5120,
2025-05-07T20:32:41.6627720Z     scale_ub=1200.0,
2025-05-07T20:32:41.6627978Z     contiguous=False,
2025-05-07T20:32:41.6628210Z     compiled=False,
2025-05-07T20:32:41.6628415Z )
2025-05-07T20:32:41.6628736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6629234Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.6629512Z 
2025-05-07T20:32:41.6629592Z     @given(
2025-05-07T20:32:41.6629827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6630146Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6630496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6630835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6631208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6631502Z     )
2025-05-07T20:32:41.6631846Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6632292Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6632543Z         self,
2025-05-07T20:32:41.6632734Z         T: int,
2025-05-07T20:32:41.6632932Z         D: int,
2025-05-07T20:32:41.6633151Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6633417Z         contiguous: bool,
2025-05-07T20:32:41.6633663Z         compiled: bool,
2025-05-07T20:32:41.6633890Z     ) -> None:
2025-05-07T20:32:41.6634103Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6634347Z     
2025-05-07T20:32:41.6634626Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6636722Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.6638800Z 
2025-05-07T20:32:41.6638935Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.6639150Z 
2025-05-07T20:32:41.6639254Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6639671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6640078Z     T=4096,
2025-05-07T20:32:41.6640262Z     D=7168,
2025-05-07T20:32:41.6640462Z     scale_ub=1200.0,
2025-05-07T20:32:41.6640691Z     contiguous=True,
2025-05-07T20:32:41.6640916Z     compiled=False,
2025-05-07T20:32:41.6641204Z )
2025-05-07T20:32:41.7568396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7569127Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.7569406Z 
2025-05-07T20:32:41.7569495Z     @given(
2025-05-07T20:32:41.7569723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7570041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7570351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7570685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7571013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7571307Z     )
2025-05-07T20:32:41.7571666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7572105Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7572370Z         self,
2025-05-07T20:32:41.7572573Z         T: int,
2025-05-07T20:32:41.7572787Z         D: int,
2025-05-07T20:32:41.7573020Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7573300Z         contiguous: bool,
2025-05-07T20:32:41.7573541Z         compiled: bool,
2025-05-07T20:32:41.7573775Z     ) -> None:
2025-05-07T20:32:41.7574001Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7574242Z     
2025-05-07T20:32:41.7574518Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7576843Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.7578797Z 
2025-05-07T20:32:41.7578919Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.7579133Z 
2025-05-07T20:32:41.7579246Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7579659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7580069Z     T=16384,
2025-05-07T20:32:41.7580269Z     D=7168,
2025-05-07T20:32:41.7580466Z     scale_ub=None,
2025-05-07T20:32:41.7580686Z     contiguous=False,
2025-05-07T20:32:41.7580922Z     compiled=True,
2025-05-07T20:32:41.7581122Z )
2025-05-07T20:32:41.7581440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7581939Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.7582215Z 
2025-05-07T20:32:41.7582301Z     @given(
2025-05-07T20:32:41.7582606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7582931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7583240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7583563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7583892Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7584180Z     )
2025-05-07T20:32:41.7584526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7584972Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7585220Z         self,
2025-05-07T20:32:41.7585417Z         T: int,
2025-05-07T20:32:41.7585608Z         D: int,
2025-05-07T20:32:41.7585825Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7586098Z         contiguous: bool,
2025-05-07T20:32:41.7586332Z         compiled: bool,
2025-05-07T20:32:41.7586558Z     ) -> None:
2025-05-07T20:32:41.7586779Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7587019Z     
2025-05-07T20:32:41.7587304Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7589351Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.7591295Z 
2025-05-07T20:32:41.7591421Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.7591633Z 
2025-05-07T20:32:41.7591741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7592165Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7592572Z     T=4096,
2025-05-07T20:32:41.7592770Z     D=7168,
2025-05-07T20:32:41.7592962Z     scale_ub=None,
2025-05-07T20:32:41.7593187Z     contiguous=True,
2025-05-07T20:32:41.7593419Z     compiled=False,
2025-05-07T20:32:41.7593622Z )
2025-05-07T20:32:41.7593947Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7594443Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.7603822Z 
2025-05-07T20:32:41.7603934Z     @given(
2025-05-07T20:32:41.7604176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7604502Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7604817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7605152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7605477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7605766Z     )
2025-05-07T20:32:41.7606118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7606558Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7606888Z         self,
2025-05-07T20:32:41.7607097Z         T: int,
2025-05-07T20:32:41.7607345Z         D: int,
2025-05-07T20:32:41.7607573Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7607859Z         contiguous: bool,
2025-05-07T20:32:41.7608097Z         compiled: bool,
2025-05-07T20:32:41.7608325Z     ) -> None:
2025-05-07T20:32:41.7608545Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7608783Z     
2025-05-07T20:32:41.7609069Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7611183Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.7613063Z 
2025-05-07T20:32:41.7613184Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.7613402Z 
2025-05-07T20:32:41.7613518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7613927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7614338Z     T=16384,
2025-05-07T20:32:41.7614536Z     D=7168,
2025-05-07T20:32:41.7614725Z     scale_ub=None,
2025-05-07T20:32:41.7614948Z     contiguous=True,
2025-05-07T20:32:41.7615178Z     compiled=False,
2025-05-07T20:32:41.7615391Z )
2025-05-07T20:32:41.7615704Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7616204Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.7616482Z 
2025-05-07T20:32:41.7616569Z     @given(
2025-05-07T20:32:41.7616796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7617166Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7617482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7617803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7618135Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7618425Z     )
2025-05-07T20:32:41.7618777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7619217Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7619464Z         self,
2025-05-07T20:32:41.7619664Z         T: int,
2025-05-07T20:32:41.7619858Z         D: int,
2025-05-07T20:32:41.7620079Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7620352Z         contiguous: bool,
2025-05-07T20:32:41.7620590Z         compiled: bool,
2025-05-07T20:32:41.7620821Z     ) -> None:
2025-05-07T20:32:41.7621044Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7621288Z     
2025-05-07T20:32:41.7621571Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7623635Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.7625492Z 
2025-05-07T20:32:41.7625622Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.7625835Z 
2025-05-07T20:32:41.7625949Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7626360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7626778Z     T=16384,
2025-05-07T20:32:41.7627027Z     D=7168,
2025-05-07T20:32:41.7627265Z     scale_ub=1200.0,
2025-05-07T20:32:41.7627490Z     contiguous=True,
2025-05-07T20:32:41.7627723Z     compiled=False,
2025-05-07T20:32:41.7627928Z )
2025-05-07T20:32:41.7628256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7628770Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.7629053Z 
2025-05-07T20:32:41.7629140Z     @given(
2025-05-07T20:32:41.7629370Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7629685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7629990Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7630318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7630649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7630934Z     )
2025-05-07T20:32:41.7631332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7631777Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7632018Z         self,
2025-05-07T20:32:41.7632215Z         T: int,
2025-05-07T20:32:41.7632410Z         D: int,
2025-05-07T20:32:41.7632621Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7632892Z         contiguous: bool,
2025-05-07T20:32:41.7633131Z         compiled: bool,
2025-05-07T20:32:41.7633349Z     ) -> None:
2025-05-07T20:32:41.7633564Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7633807Z     
2025-05-07T20:32:41.7634076Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7636138Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.7638109Z 
2025-05-07T20:32:41.7638226Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.7638723Z 
2025-05-07T20:32:41.7638829Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7639244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7639641Z     T=128,
2025-05-07T20:32:41.7639831Z     D=5120,
2025-05-07T20:32:41.7640024Z     scale_ub=1200.0,
2025-05-07T20:32:41.7640245Z     contiguous=False,
2025-05-07T20:32:41.7640475Z     compiled=False,
2025-05-07T20:32:41.7640678Z )
2025-05-07T20:32:41.8660362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8661085Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.8661398Z 
2025-05-07T20:32:41.8661486Z     @given(
2025-05-07T20:32:41.8661727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8662046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8662354Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8662678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8663006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8663296Z     )
2025-05-07T20:32:41.8663643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8664091Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8664341Z         self,
2025-05-07T20:32:41.8664533Z         T: int,
2025-05-07T20:32:41.8664734Z         D: int,
2025-05-07T20:32:41.8664958Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8665230Z         contiguous: bool,
2025-05-07T20:32:41.8665476Z         compiled: bool,
2025-05-07T20:32:41.8665710Z     ) -> None:
2025-05-07T20:32:41.8666254Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8666508Z     
2025-05-07T20:32:41.8666872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8667212Z     
2025-05-07T20:32:41.8667414Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8667710Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8668026Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8668262Z         x0 = x[:, :D]
2025-05-07T20:32:41.8668485Z         x1 = x[:, D:]
2025-05-07T20:32:41.8668697Z     
2025-05-07T20:32:41.8668884Z         if contiguous:
2025-05-07T20:32:41.8669122Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8669383Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8669619Z     
2025-05-07T20:32:41.8669815Z         if scale_ub is not None:
2025-05-07T20:32:41.8670088Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8670503Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8670819Z             )
2025-05-07T20:32:41.8671020Z         else:
2025-05-07T20:32:41.8671229Z             scale_ub_tensor = None
2025-05-07T20:32:41.8671487Z     
2025-05-07T20:32:41.8671724Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8672033Z             op = silu_mul_quant
2025-05-07T20:32:41.8672287Z             if compiled:
2025-05-07T20:32:41.8672540Z                 op = torch.compile(op)
2025-05-07T20:32:41.8672836Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8673104Z     
2025-05-07T20:32:41.8673304Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8673468Z 
2025-05-07T20:32:41.8673580Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8673870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8674207Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8674496Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8675189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8675977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8676526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8677224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8677939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8678475Z     kernel = self.compile(
2025-05-07T20:32:41.8679023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8679689Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8680083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8680323Z 
2025-05-07T20:32:41.8680535Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286ba0550>
2025-05-07T20:32:41.8681625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8683024Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286dd2700>}
2025-05-07T20:32:41.8684475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8685513Z context = <triton._C.libtriton.ir.context object at 0x7fb286ba49b0>
2025-05-07T20:32:41.8685809Z 
2025-05-07T20:32:41.8685980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8686562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8687073Z                            module_map=module_map)
2025-05-07T20:32:41.8687443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8687802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8688063Z E       ^
2025-05-07T20:32:41.8688533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8688994Z 
2025-05-07T20:32:41.8689419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8689936Z 
2025-05-07T20:32:41.8690046Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8690458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8690864Z     T=2048,
2025-05-07T20:32:41.8691056Z     D=7168,
2025-05-07T20:32:41.8691320Z     scale_ub=None,
2025-05-07T20:32:41.8691541Z     contiguous=False,
2025-05-07T20:32:41.8691771Z     compiled=False,
2025-05-07T20:32:41.8691976Z )
2025-05-07T20:32:41.8692295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8692812Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.8693084Z 
2025-05-07T20:32:41.8693163Z     @given(
2025-05-07T20:32:41.8693386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8693722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8694033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8694354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8694688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8694980Z     )
2025-05-07T20:32:41.8695329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8695783Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8696023Z         self,
2025-05-07T20:32:41.8696219Z         T: int,
2025-05-07T20:32:41.8696469Z         D: int,
2025-05-07T20:32:41.8696692Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8696971Z         contiguous: bool,
2025-05-07T20:32:41.8697205Z         compiled: bool,
2025-05-07T20:32:41.8697432Z     ) -> None:
2025-05-07T20:32:41.8697651Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8697889Z     
2025-05-07T20:32:41.8698165Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8700231Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.8702088Z 
2025-05-07T20:32:41.8702214Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.8702425Z 
2025-05-07T20:32:41.8702528Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8702938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8703342Z     T=128,
2025-05-07T20:32:41.8703531Z     D=7168,
2025-05-07T20:32:41.8703721Z     scale_ub=1200.0,
2025-05-07T20:32:41.8703945Z     contiguous=True,
2025-05-07T20:32:41.8704169Z     compiled=True,
2025-05-07T20:32:41.8704368Z )
2025-05-07T20:32:41.9013601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9014114Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.9014503Z 
2025-05-07T20:32:41.9014620Z     @given(
2025-05-07T20:32:41.9014949Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9015475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9015865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9016211Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9016555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9016849Z     )
2025-05-07T20:32:41.9017213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9017673Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9017923Z         self,
2025-05-07T20:32:41.9018136Z         T: int,
2025-05-07T20:32:41.9018353Z         D: int,
2025-05-07T20:32:41.9018583Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9018874Z         contiguous: bool,
2025-05-07T20:32:41.9019133Z         compiled: bool,
2025-05-07T20:32:41.9019367Z     ) -> None:
2025-05-07T20:32:41.9019602Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9019861Z     
2025-05-07T20:32:41.9020441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9020810Z     
2025-05-07T20:32:41.9021024Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9021335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9021650Z         x = x_sign * x_clamp
2025-05-07T20:32:41.9021909Z         x0 = x[:, :D]
2025-05-07T20:32:41.9022144Z         x1 = x[:, D:]
2025-05-07T20:32:41.9022361Z     
2025-05-07T20:32:41.9022564Z         if contiguous:
2025-05-07T20:32:41.9022815Z             x0 = x0.contiguous()
2025-05-07T20:32:41.9023081Z             x1 = x1.contiguous()
2025-05-07T20:32:41.9023342Z     
2025-05-07T20:32:41.9023549Z         if scale_ub is not None:
2025-05-07T20:32:41.9023830Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.9024185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.9024509Z             )
2025-05-07T20:32:41.9024711Z         else:
2025-05-07T20:32:41.9024942Z             scale_ub_tensor = None
2025-05-07T20:32:41.9025210Z     
2025-05-07T20:32:41.9025457Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.9025881Z             op = silu_mul_quant
2025-05-07T20:32:41.9026146Z             if compiled:
2025-05-07T20:32:41.9026408Z                 op = torch.compile(op)
2025-05-07T20:32:41.9026714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.9027005Z     
2025-05-07T20:32:41.9027216Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.9027411Z 
2025-05-07T20:32:41.9027541Z moe/activation_test.py:117: 
2025-05-07T20:32:41.9027851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.9028196Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.9028486Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.9029062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.9029642Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.9030327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.9031031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.9031582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.9032283Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.9032957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.9033510Z     kernel = self.compile(
2025-05-07T20:32:41.9034067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.9034745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.9035153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.9035396Z 
2025-05-07T20:32:41.9035662Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286a1bc50>
2025-05-07T20:32:41.9036801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.9038195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286dd3f60>}
2025-05-07T20:32:41.9039790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.9040832Z context = <triton._C.libtriton.ir.context object at 0x7fb286abc170>
2025-05-07T20:32:41.9041132Z 
2025-05-07T20:32:41.9041377Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.9041922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.9042404Z                            module_map=module_map)
2025-05-07T20:32:41.9042789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.9043167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.9043441Z E       ^
2025-05-07T20:32:41.9044000Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.9044464Z 
2025-05-07T20:32:41.9044891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.9045415Z 
2025-05-07T20:32:41.9045530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9045962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9046380Z     T=128,
2025-05-07T20:32:41.9046585Z     D=7168,
2025-05-07T20:32:41.9046800Z     scale_ub=1200.0,
2025-05-07T20:32:41.9047102Z     contiguous=True,
2025-05-07T20:32:41.9047341Z     compiled=False,
2025-05-07T20:32:41.9047562Z )
2025-05-07T20:32:41.9047888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9048399Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.9048674Z 
2025-05-07T20:32:41.9048765Z     @given(
2025-05-07T20:32:41.9048999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9049323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9049640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9049977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9050308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9050611Z     )
2025-05-07T20:32:41.9050975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9051421Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9051677Z         self,
2025-05-07T20:32:41.9051891Z         T: int,
2025-05-07T20:32:41.9052095Z         D: int,
2025-05-07T20:32:41.9052329Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9052614Z         contiguous: bool,
2025-05-07T20:32:41.9052862Z         compiled: bool,
2025-05-07T20:32:41.9053099Z     ) -> None:
2025-05-07T20:32:41.9053328Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9053575Z     
2025-05-07T20:32:41.9053862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9054218Z     
2025-05-07T20:32:41.9054422Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9054728Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9056820Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9058745Z 
2025-05-07T20:32:41.9058871Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.9059089Z 
2025-05-07T20:32:41.9059206Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9059625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9060044Z     T=128,
2025-05-07T20:32:41.9060247Z     D=5120,
2025-05-07T20:32:41.9060447Z     scale_ub=1200.0,
2025-05-07T20:32:41.9060687Z     contiguous=True,
2025-05-07T20:32:41.9060925Z     compiled=True,
2025-05-07T20:32:41.9061139Z )
2025-05-07T20:32:41.9061527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9062036Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.9062312Z 
2025-05-07T20:32:41.9062403Z     @given(
2025-05-07T20:32:41.9062639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9062965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9063284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9063617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9063959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9064259Z     )
2025-05-07T20:32:41.9064614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9065263Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9065524Z         self,
2025-05-07T20:32:41.9065732Z         T: int,
2025-05-07T20:32:41.9065936Z         D: int,
2025-05-07T20:32:41.9066170Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9066455Z         contiguous: bool,
2025-05-07T20:32:41.9066703Z         compiled: bool,
2025-05-07T20:32:41.9066992Z     ) -> None:
2025-05-07T20:32:41.9067220Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9067467Z     
2025-05-07T20:32:41.9067753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9068104Z     
2025-05-07T20:32:41.9068302Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9068606Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9070601Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9072452Z 
2025-05-07T20:32:41.9072578Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.9072796Z 
2025-05-07T20:32:41.9072911Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9073327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9073741Z     T=128,
2025-05-07T20:32:41.9073941Z     D=7168,
2025-05-07T20:32:41.9074142Z     scale_ub=None,
2025-05-07T20:32:41.9074369Z     contiguous=True,
2025-05-07T20:32:41.9074604Z     compiled=True,
2025-05-07T20:32:41.9074812Z )
2025-05-07T20:32:42.1075182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1075730Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1076005Z 
2025-05-07T20:32:42.1076087Z     @given(
2025-05-07T20:32:42.1076325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1076656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1077283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1077785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1078109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1078402Z     )
2025-05-07T20:32:42.1078756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1088035Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1088314Z         self,
2025-05-07T20:32:42.1088521Z         T: int,
2025-05-07T20:32:42.1088725Z         D: int,
2025-05-07T20:32:42.1088941Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1089227Z         contiguous: bool,
2025-05-07T20:32:42.1089473Z         compiled: bool,
2025-05-07T20:32:42.1089702Z     ) -> None:
2025-05-07T20:32:42.1089929Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1090179Z     
2025-05-07T20:32:42.1090596Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1092673Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.1094549Z 
2025-05-07T20:32:42.1094671Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.1094893Z 
2025-05-07T20:32:42.1095465Z FAILED
2025-05-07T20:32:42.1095577Z 
2025-05-07T20:32:42.1095715Z =================================== FAILURES ===================================
2025-05-07T20:32:42.1096144Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:42.1096668Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:42.1098489Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:42.1099122Z   |     yield
2025-05-07T20:32:42.1099683Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:42.1100261Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:42.1100938Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:42.1101650Z   |     if method() is not None:
2025-05-07T20:32:42.1101912Z   |        ^^^^^^^^
2025-05-07T20:32:42.1102675Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:42.1103589Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1104000Z   |            ^^^^^^^
2025-05-07T20:32:42.1104804Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:42.1105675Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:42.1106253Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:42.1106847Z   +-+---------------- 1 ----------------
2025-05-07T20:32:42.1107253Z     | Traceback (most recent call last):
2025-05-07T20:32:42.1108298Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.1109391Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1109906Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1112362Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.1115213Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.1115825Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1116389Z     |     T=2048,
2025-05-07T20:32:42.1116717Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:42.1117628Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:42.1118173Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:42.1118732Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:42.1119169Z     | )
2025-05-07T20:32:42.1119428Z     | 
2025-05-07T20:32:42.1120138Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:42.1120985Z     +---------------- 2 ----------------
2025-05-07T20:32:42.1121384Z     | Traceback (most recent call last):
2025-05-07T20:32:42.1122379Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.1123460Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1124134Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1126916Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.1129713Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.1130311Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1130770Z     |     T=128,
2025-05-07T20:32:42.1130979Z     |     D=7168,
2025-05-07T20:32:42.1131191Z     |     scale_ub=None,
2025-05-07T20:32:42.1131485Z     |     contiguous=True,
2025-05-07T20:32:42.1131831Z     |     compiled=True,
2025-05-07T20:32:42.1132141Z     | )
2025-05-07T20:32:42.1132391Z     | 
2025-05-07T20:32:42.1133120Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.1133934Z     +---------------- 3 ----------------
2025-05-07T20:32:42.1134225Z     | Traceback (most recent call last):
2025-05-07T20:32:42.1134952Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.1135733Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1136115Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1138164Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.1140412Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.1140850Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1141260Z     |     T=128,
2025-05-07T20:32:42.1141467Z     |     D=5120,
2025-05-07T20:32:42.1141676Z     |     scale_ub=1200.0,
2025-05-07T20:32:42.1141920Z     |     contiguous=True,
2025-05-07T20:32:42.1142165Z     |     compiled=True,
2025-05-07T20:32:42.1142387Z     | )
2025-05-07T20:32:42.1142570Z     | 
2025-05-07T20:32:42.1143095Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.1143694Z     +---------------- 4 ----------------
2025-05-07T20:32:42.1144083Z     | Traceback (most recent call last):
2025-05-07T20:32:42.1144813Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:42.1145537Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1145822Z     |                              ^^^^^^^^
2025-05-07T20:32:42.1146468Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:42.1147169Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1147503Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1148308Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:42.1149111Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1149734Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:42.1150547Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1150995Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1151639Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:42.1152425Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1152899Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1153578Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:42.1154393Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1154866Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1155506Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:42.1156222Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1156602Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1157201Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:42.1157772Z     |     fn()
2025-05-07T20:32:42.1158348Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:42.1158984Z     |     self.fn.run(
2025-05-07T20:32:42.1159587Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:42.1160176Z     |     kernel = self.compile(
2025-05-07T20:32:42.1160502Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:42.1161095Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:42.1161808Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1162198Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1162845Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.1163766Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1164245Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1164714Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1165081Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1165341Z     | ^
2025-05-07T20:32:42.1165802Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1166390Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.1166788Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:42.1167307Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1167748Z     |     T=1,  # or any other generated value
2025-05-07T20:32:42.1168055Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:42.1168395Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:42.1168766Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:42.1169125Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:42.1169516Z     | )
2025-05-07T20:32:42.1169762Z     | 
2025-05-07T20:32:42.1170564Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.1171406Z     +------------------------------------
2025-05-07T20:32:42.1171896Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:42.1172414Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1172990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1173563Z     T=1,
2025-05-07T20:32:42.1173829Z     D=5120,
2025-05-07T20:32:42.1174095Z     scale_ub=None,
2025-05-07T20:32:42.1174407Z     contiguous=True,
2025-05-07T20:32:42.1174724Z     compiled=True,
2025-05-07T20:32:42.1175029Z )
2025-05-07T20:32:42.1175481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1176169Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1176538Z 
2025-05-07T20:32:42.1176680Z     @given(
2025-05-07T20:32:42.1177001Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1177496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1177936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1178401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1178868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1179276Z     )
2025-05-07T20:32:42.1179766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1180407Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1180753Z         self,
2025-05-07T20:32:42.1181012Z         T: int,
2025-05-07T20:32:42.1181299Z         D: int,
2025-05-07T20:32:42.1181618Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1181996Z         contiguous: bool,
2025-05-07T20:32:42.1182343Z         compiled: bool,
2025-05-07T20:32:42.1182664Z     ) -> None:
2025-05-07T20:32:42.1183079Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1183479Z     
2025-05-07T20:32:42.1183865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1184357Z     
2025-05-07T20:32:42.1184625Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1185040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1185483Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1185816Z         x0 = x[:, :D]
2025-05-07T20:32:42.1186133Z         x1 = x[:, D:]
2025-05-07T20:32:42.1186438Z     
2025-05-07T20:32:42.1186703Z         if contiguous:
2025-05-07T20:32:42.1187039Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1187410Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1187744Z     
2025-05-07T20:32:42.1188022Z         if scale_ub is not None:
2025-05-07T20:32:42.1188415Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1188959Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1189392Z             )
2025-05-07T20:32:42.1189686Z         else:
2025-05-07T20:32:42.1189994Z             scale_ub_tensor = None
2025-05-07T20:32:42.1190345Z     
2025-05-07T20:32:42.1190672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1191120Z             op = silu_mul_quant
2025-05-07T20:32:42.1191468Z             if compiled:
2025-05-07T20:32:42.1191820Z                 op = torch.compile(op)
2025-05-07T20:32:42.1192243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1192630Z     
2025-05-07T20:32:42.1192903Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1193320Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1193731Z     
2025-05-07T20:32:42.1194077Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1194540Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1194935Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1195380Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1195890Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1196401Z     
2025-05-07T20:32:42.1196677Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1196963Z 
2025-05-07T20:32:42.1197108Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1197580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1198049Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1198516Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1199644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1200732Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1201504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1202487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1203440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1204600Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1205726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1206804Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1207839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1208742Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1209598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1210343Z     fn()
2025-05-07T20:32:42.1211124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1212008Z     self.fn.run(
2025-05-07T20:32:42.1212677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1213408Z     kernel = self.compile(
2025-05-07T20:32:42.1214140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1215077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1215642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1215967Z 
2025-05-07T20:32:42.1216261Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3b0d4bf10>
2025-05-07T20:32:42.1217802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1219864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3ab33d3a0>}
2025-05-07T20:32:42.1221787Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1223242Z context = <triton._C.libtriton.ir.context object at 0x7fb3b0d933b0>
2025-05-07T20:32:42.1223652Z 
2025-05-07T20:32:42.1223895Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1224625Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1225291Z                            module_map=module_map)
2025-05-07T20:32:42.1225802Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1226354Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1226743Z E       ^
2025-05-07T20:32:42.1227398Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1228034Z 
2025-05-07T20:32:42.1228604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1229288Z 
2025-05-07T20:32:42.1229422Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1229964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1230497Z     T=2048,
2025-05-07T20:32:42.1230736Z     D=5120,
2025-05-07T20:32:42.1231001Z     scale_ub=1200.0,
2025-05-07T20:32:42.1231302Z     contiguous=True,
2025-05-07T20:32:42.1231592Z     compiled=False,
2025-05-07T20:32:42.1231869Z )
2025-05-07T20:32:42.1232304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1232992Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.1233370Z 
2025-05-07T20:32:42.1233472Z     @given(
2025-05-07T20:32:42.1233787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1234216Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1234620Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1235079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1235531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1235904Z     )
2025-05-07T20:32:42.1236396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1237004Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1237327Z         self,
2025-05-07T20:32:42.1237581Z         T: int,
2025-05-07T20:32:42.1237841Z         D: int,
2025-05-07T20:32:42.1238126Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1238726Z         contiguous: bool,
2025-05-07T20:32:42.1239196Z         compiled: bool,
2025-05-07T20:32:42.1239562Z     ) -> None:
2025-05-07T20:32:42.1239839Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1240173Z     
2025-05-07T20:32:42.1240551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1241021Z     
2025-05-07T20:32:42.1241291Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1241697Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1242126Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1242463Z         x0 = x[:, :D]
2025-05-07T20:32:42.1242775Z         x1 = x[:, D:]
2025-05-07T20:32:42.1243064Z     
2025-05-07T20:32:42.1243328Z         if contiguous:
2025-05-07T20:32:42.1243755Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1244116Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1244455Z     
2025-05-07T20:32:42.1244734Z         if scale_ub is not None:
2025-05-07T20:32:42.1245205Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1245699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1246142Z             )
2025-05-07T20:32:42.1246415Z         else:
2025-05-07T20:32:42.1246701Z             scale_ub_tensor = None
2025-05-07T20:32:42.1247054Z     
2025-05-07T20:32:42.1247381Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1247817Z             op = silu_mul_quant
2025-05-07T20:32:42.1248173Z             if compiled:
2025-05-07T20:32:42.1248523Z                 op = torch.compile(op)
2025-05-07T20:32:42.1248927Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1249313Z     
2025-05-07T20:32:42.1249583Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1249805Z 
2025-05-07T20:32:42.1249937Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1250341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1250799Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1251183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1252229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1253208Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1253945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1254880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1255795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1256543Z     kernel = self.compile(
2025-05-07T20:32:42.1257313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1258240Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1258802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1259131Z 
2025-05-07T20:32:42.1259427Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3ab7c0790>
2025-05-07T20:32:42.1260968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1262912Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3ab1ec2c0>}
2025-05-07T20:32:42.1264690Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1266040Z context = <triton._C.libtriton.ir.context object at 0x7fb3ab76e6f0>
2025-05-07T20:32:42.1266417Z 
2025-05-07T20:32:42.1266697Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1267482Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1268090Z                            module_map=module_map)
2025-05-07T20:32:42.1268582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1269099Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1269448Z E       ^
2025-05-07T20:32:42.1270087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1270713Z 
2025-05-07T20:32:42.1271290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1272001Z 
2025-05-07T20:32:42.1272149Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1272750Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1273306Z     T=2048,
2025-05-07T20:32:42.1273577Z     D=5120,
2025-05-07T20:32:42.1273835Z     scale_ub=1200.0,
2025-05-07T20:32:42.1274149Z     contiguous=True,
2025-05-07T20:32:42.1274465Z     compiled=True,
2025-05-07T20:32:42.1274745Z )
2025-05-07T20:32:42.1275194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1298361Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.1298922Z 
2025-05-07T20:32:42.1299037Z     @given(
2025-05-07T20:32:42.1299342Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1299754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1300156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1300584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1301013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1301392Z     )
2025-05-07T20:32:42.1301869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1302632Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1302942Z         self,
2025-05-07T20:32:42.1303188Z         T: int,
2025-05-07T20:32:42.1303443Z         D: int,
2025-05-07T20:32:42.1303734Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1304104Z         contiguous: bool,
2025-05-07T20:32:42.1304424Z         compiled: bool,
2025-05-07T20:32:42.1304713Z     ) -> None:
2025-05-07T20:32:42.1305009Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1305354Z     
2025-05-07T20:32:42.1305736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1306218Z     
2025-05-07T20:32:42.1306476Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1306877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1307315Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1307574Z         x0 = x[:, :D]
2025-05-07T20:32:42.1307796Z         x1 = x[:, D:]
2025-05-07T20:32:42.1308005Z     
2025-05-07T20:32:42.1308186Z         if contiguous:
2025-05-07T20:32:42.1308433Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1308690Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1308917Z     
2025-05-07T20:32:42.1309107Z         if scale_ub is not None:
2025-05-07T20:32:42.1309373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1309701Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1310019Z             )
2025-05-07T20:32:42.1310215Z         else:
2025-05-07T20:32:42.1310425Z             scale_ub_tensor = None
2025-05-07T20:32:42.1310671Z     
2025-05-07T20:32:42.1310895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1311203Z             op = silu_mul_quant
2025-05-07T20:32:42.1311439Z             if compiled:
2025-05-07T20:32:42.1311673Z                 op = torch.compile(op)
2025-05-07T20:32:42.1311966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1312236Z     
2025-05-07T20:32:42.1312493Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1312774Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1313110Z     
2025-05-07T20:32:42.1313344Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1313673Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1313957Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1314270Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1314624Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1314923Z     
2025-05-07T20:32:42.1315121Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1315320Z 
2025-05-07T20:32:42.1315417Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1315708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1316039Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1316409Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1317209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1317966Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1318513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1319188Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1319881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1320596Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1321350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1322106Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1322898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1323664Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1324287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1324802Z     fn()
2025-05-07T20:32:42.1325299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1325870Z     self.fn.run(
2025-05-07T20:32:42.1326336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1326862Z     kernel = self.compile(
2025-05-07T20:32:42.1327449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1328108Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1328508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1328735Z 
2025-05-07T20:32:42.1328946Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3aa101910>
2025-05-07T20:32:42.1330018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1331396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3aa0eb880>}
2025-05-07T20:32:42.1332734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1333796Z context = <triton._C.libtriton.ir.context object at 0x7fb3aa1058b0>
2025-05-07T20:32:42.1334132Z 
2025-05-07T20:32:42.1334297Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1334814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1335279Z                            module_map=module_map)
2025-05-07T20:32:42.1335631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1335984Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1336243Z E       ^
2025-05-07T20:32:42.1336696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1337153Z 
2025-05-07T20:32:42.1337570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1338088Z 
2025-05-07T20:32:42.1338233Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1338989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1339393Z     T=16384,
2025-05-07T20:32:42.1339587Z     D=7168,
2025-05-07T20:32:42.1339777Z     scale_ub=1200.0,
2025-05-07T20:32:42.1339991Z     contiguous=False,
2025-05-07T20:32:42.1340209Z     compiled=False,
2025-05-07T20:32:42.1340407Z )
2025-05-07T20:32:42.1340715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1341216Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1341502Z 
2025-05-07T20:32:42.1341579Z     @given(
2025-05-07T20:32:42.1341803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1342104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1342399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1342723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1343042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1343489Z     )
2025-05-07T20:32:42.1343835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1344268Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1344501Z         self,
2025-05-07T20:32:42.1344689Z         T: int,
2025-05-07T20:32:42.1344872Z         D: int,
2025-05-07T20:32:42.1345083Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1345346Z         contiguous: bool,
2025-05-07T20:32:42.1345574Z         compiled: bool,
2025-05-07T20:32:42.1345782Z     ) -> None:
2025-05-07T20:32:42.1345989Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1346223Z     
2025-05-07T20:32:42.1346489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1346838Z     
2025-05-07T20:32:42.1347029Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1347333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1347659Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1347905Z         x0 = x[:, :D]
2025-05-07T20:32:42.1348120Z         x1 = x[:, D:]
2025-05-07T20:32:42.1348326Z     
2025-05-07T20:32:42.1348507Z         if contiguous:
2025-05-07T20:32:42.1348726Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1348982Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1349224Z     
2025-05-07T20:32:42.1349407Z         if scale_ub is not None:
2025-05-07T20:32:42.1349675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1350005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1350315Z             )
2025-05-07T20:32:42.1350500Z         else:
2025-05-07T20:32:42.1350711Z             scale_ub_tensor = None
2025-05-07T20:32:42.1350960Z     
2025-05-07T20:32:42.1351183Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1351494Z             op = silu_mul_quant
2025-05-07T20:32:42.1351744Z             if compiled:
2025-05-07T20:32:42.1351984Z                 op = torch.compile(op)
2025-05-07T20:32:42.1352354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1352694Z     
2025-05-07T20:32:42.1352877Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1353050Z 
2025-05-07T20:32:42.1353148Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1353439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1353762Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1354045Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1354737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1355431Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1355963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1356721Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1357394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1357983Z     kernel = self.compile(
2025-05-07T20:32:42.1358519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1359178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1359580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1359807Z 
2025-05-07T20:32:42.1360012Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a9da9c10>
2025-05-07T20:32:42.1361092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1362468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9e23380>}
2025-05-07T20:32:42.1363988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1365016Z context = <triton._C.libtriton.ir.context object at 0x7fb3a9dd2270>
2025-05-07T20:32:42.1365300Z 
2025-05-07T20:32:42.1365466Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1365988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1366456Z                            module_map=module_map)
2025-05-07T20:32:42.1366816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1367168Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1367430Z E       ^
2025-05-07T20:32:42.1367906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1368363Z 
2025-05-07T20:32:42.1368781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1369302Z 
2025-05-07T20:32:42.1369405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1369816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1370219Z     T=1,
2025-05-07T20:32:42.1370393Z     D=7168,
2025-05-07T20:32:42.1370583Z     scale_ub=None,
2025-05-07T20:32:42.1370792Z     contiguous=True,
2025-05-07T20:32:42.1371003Z     compiled=True,
2025-05-07T20:32:42.1371204Z )
2025-05-07T20:32:42.1371518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1371992Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1372253Z 
2025-05-07T20:32:42.1372335Z     @given(
2025-05-07T20:32:42.1372612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1372957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1373258Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1373582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1373906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1374187Z     )
2025-05-07T20:32:42.1374536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1374974Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1375218Z         self,
2025-05-07T20:32:42.1375410Z         T: int,
2025-05-07T20:32:42.1375601Z         D: int,
2025-05-07T20:32:42.1375819Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1376086Z         contiguous: bool,
2025-05-07T20:32:42.1376325Z         compiled: bool,
2025-05-07T20:32:42.1376538Z     ) -> None:
2025-05-07T20:32:42.1376820Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1377063Z     
2025-05-07T20:32:42.1377353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1377723Z     
2025-05-07T20:32:42.1377917Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1378200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1378506Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1378744Z         x0 = x[:, :D]
2025-05-07T20:32:42.1378954Z         x1 = x[:, D:]
2025-05-07T20:32:42.1379163Z     
2025-05-07T20:32:42.1379346Z         if contiguous:
2025-05-07T20:32:42.1379567Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1379822Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1380060Z     
2025-05-07T20:32:42.1380240Z         if scale_ub is not None:
2025-05-07T20:32:42.1380513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1380846Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1381153Z             )
2025-05-07T20:32:42.1381341Z         else:
2025-05-07T20:32:42.1381548Z             scale_ub_tensor = None
2025-05-07T20:32:42.1381856Z     
2025-05-07T20:32:42.1382084Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1382393Z             op = silu_mul_quant
2025-05-07T20:32:42.1382640Z             if compiled:
2025-05-07T20:32:42.1382876Z                 op = torch.compile(op)
2025-05-07T20:32:42.1383171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1383439Z     
2025-05-07T20:32:42.1383622Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1383909Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1384200Z     
2025-05-07T20:32:42.1384430Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1384760Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1385048Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1385353Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1385716Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1386029Z     
2025-05-07T20:32:42.1386235Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1386432Z 
2025-05-07T20:32:42.1386530Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1386825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1387161Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1387480Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1388271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1389030Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1389585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1390265Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1391010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1391782Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1392542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1393289Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1394025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1394672Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1395270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1395793Z     fn()
2025-05-07T20:32:42.1396349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1396943Z     self.fn.run(
2025-05-07T20:32:42.1397409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1397954Z     kernel = self.compile(
2025-05-07T20:32:42.1398498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1399152Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1399552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1399784Z 
2025-05-07T20:32:42.1399989Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a97f9910>
2025-05-07T20:32:42.1401070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1402448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9e979c0>}
2025-05-07T20:32:42.1403917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1404946Z context = <triton._C.libtriton.ir.context object at 0x7fb3a9af5470>
2025-05-07T20:32:42.1405239Z 
2025-05-07T20:32:42.1405404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1405926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1406386Z                            module_map=module_map)
2025-05-07T20:32:42.1406750Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1407111Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1407370Z E       ^
2025-05-07T20:32:42.1407832Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1408292Z 
2025-05-07T20:32:42.1408711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1409228Z 
2025-05-07T20:32:42.1409336Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1409740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1410143Z     T=4096,
2025-05-07T20:32:42.1410328Z     D=5120,
2025-05-07T20:32:42.1410513Z     scale_ub=None,
2025-05-07T20:32:42.1410729Z     contiguous=False,
2025-05-07T20:32:42.1410955Z     compiled=False,
2025-05-07T20:32:42.1411150Z )
2025-05-07T20:32:42.1411467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1411968Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.1412294Z 
2025-05-07T20:32:42.1412420Z     @given(
2025-05-07T20:32:42.1412637Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1412947Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1413253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1413576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1413906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1414196Z     )
2025-05-07T20:32:42.1414542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1414989Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1415230Z         self,
2025-05-07T20:32:42.1415415Z         T: int,
2025-05-07T20:32:42.1415613Z         D: int,
2025-05-07T20:32:42.1415830Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1416103Z         contiguous: bool,
2025-05-07T20:32:42.1416379Z         compiled: bool,
2025-05-07T20:32:42.1416603Z     ) -> None:
2025-05-07T20:32:42.1416825Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1417061Z     
2025-05-07T20:32:42.1417334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1417680Z     
2025-05-07T20:32:42.1417865Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1418158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1418469Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1418703Z         x0 = x[:, :D]
2025-05-07T20:32:42.1418914Z         x1 = x[:, D:]
2025-05-07T20:32:42.1419117Z     
2025-05-07T20:32:42.1419294Z         if contiguous:
2025-05-07T20:32:42.1419523Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1419782Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1420014Z     
2025-05-07T20:32:42.1420207Z         if scale_ub is not None:
2025-05-07T20:32:42.1420480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1420809Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1421120Z             )
2025-05-07T20:32:42.1421371Z         else:
2025-05-07T20:32:42.1421585Z             scale_ub_tensor = None
2025-05-07T20:32:42.1421829Z     
2025-05-07T20:32:42.1422059Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1422372Z             op = silu_mul_quant
2025-05-07T20:32:42.1422614Z             if compiled:
2025-05-07T20:32:42.1422859Z                 op = torch.compile(op)
2025-05-07T20:32:42.1423156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1423420Z     
2025-05-07T20:32:42.1423614Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1423619Z 
2025-05-07T20:32:42.1423714Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1423850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1423948Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1424055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1424562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1424665Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1425035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1425258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1425599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1425698Z     kernel = self.compile(
2025-05-07T20:32:42.1426084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1426265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1426393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1426400Z 
2025-05-07T20:32:42.1426653Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a99dd350>
2025-05-07T20:32:42.1427481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1427992Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9e78180>}
2025-05-07T20:32:42.1428753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1428943Z context = <triton._C.libtriton.ir.context object at 0x7fb3a99e19b0>
2025-05-07T20:32:42.1428949Z 
2025-05-07T20:32:42.1429162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1429429Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1429538Z                            module_map=module_map)
2025-05-07T20:32:42.1429708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1429805Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1429882Z E       ^
2025-05-07T20:32:42.1430247Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1430252Z 
2025-05-07T20:32:42.1430667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1430672Z 
2025-05-07T20:32:42.1430778Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1431002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1431078Z     T=4096,
2025-05-07T20:32:42.1431165Z     D=7168,
2025-05-07T20:32:42.1431248Z     scale_ub=None,
2025-05-07T20:32:42.1431375Z     contiguous=False,
2025-05-07T20:32:42.1431471Z     compiled=False,
2025-05-07T20:32:42.1431545Z )
2025-05-07T20:32:42.1431764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1431947Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.1431952Z 
2025-05-07T20:32:42.1432031Z     @given(
2025-05-07T20:32:42.1432155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1432257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1432370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1432489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1432603Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1432677Z     )
2025-05-07T20:32:42.1432928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1433022Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1433105Z         self,
2025-05-07T20:32:42.1433191Z         T: int,
2025-05-07T20:32:42.1433264Z         D: int,
2025-05-07T20:32:42.1433368Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1433454Z         contiguous: bool,
2025-05-07T20:32:42.1433539Z         compiled: bool,
2025-05-07T20:32:42.1433622Z     ) -> None:
2025-05-07T20:32:42.1433716Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1433786Z     
2025-05-07T20:32:42.1433962Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1434033Z     
2025-05-07T20:32:42.1434121Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1434250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1434335Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1434413Z         x0 = x[:, :D]
2025-05-07T20:32:42.1434498Z         x1 = x[:, D:]
2025-05-07T20:32:42.1434570Z     
2025-05-07T20:32:42.1434656Z         if contiguous:
2025-05-07T20:32:42.1434750Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1434885Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1435027Z     
2025-05-07T20:32:42.1435116Z         if scale_ub is not None:
2025-05-07T20:32:42.1435220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1435358Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1435430Z             )
2025-05-07T20:32:42.1435505Z         else:
2025-05-07T20:32:42.1435608Z             scale_ub_tensor = None
2025-05-07T20:32:42.1435679Z     
2025-05-07T20:32:42.1435807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1435903Z             op = silu_mul_quant
2025-05-07T20:32:42.1435985Z             if compiled:
2025-05-07T20:32:42.1436081Z                 op = torch.compile(op)
2025-05-07T20:32:42.1436196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1436266Z     
2025-05-07T20:32:42.1436362Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1436407Z 
2025-05-07T20:32:42.1436508Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1436639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1436748Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1436847Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1437352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1437451Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1437816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1438043Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1438631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1438777Z     kernel = self.compile(
2025-05-07T20:32:42.1439217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1439526Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1439651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1439665Z 
2025-05-07T20:32:42.1439872Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a99e7c50>
2025-05-07T20:32:42.1440643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1441150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9e7b880>}
2025-05-07T20:32:42.1441904Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1442107Z context = <triton._C.libtriton.ir.context object at 0x7fb3a9990270>
2025-05-07T20:32:42.1442113Z 
2025-05-07T20:32:42.1442279Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1442545Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1442662Z                            module_map=module_map)
2025-05-07T20:32:42.1442825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1442928Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1443006Z E       ^
2025-05-07T20:32:42.1443362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1443367Z 
2025-05-07T20:32:42.1443960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1443970Z 
2025-05-07T20:32:42.1444133Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1444357Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1444453Z     T=128,
2025-05-07T20:32:42.1452822Z     D=7168,
2025-05-07T20:32:42.1452934Z     scale_ub=None,
2025-05-07T20:32:42.1453027Z     contiguous=False,
2025-05-07T20:32:42.1453118Z     compiled=True,
2025-05-07T20:32:42.1453197Z )
2025-05-07T20:32:42.1453424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1453606Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1453611Z 
2025-05-07T20:32:42.1453691Z     @given(
2025-05-07T20:32:42.1453815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1453914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1454142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1454271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1454388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1454464Z     )
2025-05-07T20:32:42.1454718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1454814Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1454892Z         self,
2025-05-07T20:32:42.1454978Z         T: int,
2025-05-07T20:32:42.1455055Z         D: int,
2025-05-07T20:32:42.1455159Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1455248Z         contiguous: bool,
2025-05-07T20:32:42.1455333Z         compiled: bool,
2025-05-07T20:32:42.1455417Z     ) -> None:
2025-05-07T20:32:42.1455512Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1455586Z     
2025-05-07T20:32:42.1455765Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1455839Z     
2025-05-07T20:32:42.1455932Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1456066Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1456208Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1456291Z         x0 = x[:, :D]
2025-05-07T20:32:42.1456378Z         x1 = x[:, D:]
2025-05-07T20:32:42.1456451Z     
2025-05-07T20:32:42.1456536Z         if contiguous:
2025-05-07T20:32:42.1456638Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1456730Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1456812Z     
2025-05-07T20:32:42.1456901Z         if scale_ub is not None:
2025-05-07T20:32:42.1457007Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1457148Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1457226Z             )
2025-05-07T20:32:42.1457304Z         else:
2025-05-07T20:32:42.1457431Z             scale_ub_tensor = None
2025-05-07T20:32:42.1457509Z     
2025-05-07T20:32:42.1457662Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1457761Z             op = silu_mul_quant
2025-05-07T20:32:42.1457848Z             if compiled:
2025-05-07T20:32:42.1457951Z                 op = torch.compile(op)
2025-05-07T20:32:42.1458068Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1458143Z     
2025-05-07T20:32:42.1458243Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1458366Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1458440Z     
2025-05-07T20:32:42.1458585Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1458687Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1458789Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1458920Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1459061Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1459135Z     
2025-05-07T20:32:42.1459244Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1459248Z 
2025-05-07T20:32:42.1459349Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1459537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1459683Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1459816Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1460391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1460495Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1460860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1461094Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1461466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1461771Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1462180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1462438Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1462825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1462998Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1463349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1463428Z     fn()
2025-05-07T20:32:42.1463832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1463924Z     self.fn.run(
2025-05-07T20:32:42.1464269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1464370Z     kernel = self.compile(
2025-05-07T20:32:42.1464804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1464985Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1465123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1465128Z 
2025-05-07T20:32:42.1465336Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a92f1910>
2025-05-07T20:32:42.1466112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1466625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9a6ed40>}
2025-05-07T20:32:42.1467406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1467638Z context = <triton._C.libtriton.ir.context object at 0x7fb3a93f3ef0>
2025-05-07T20:32:42.1467643Z 
2025-05-07T20:32:42.1467808Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1468083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1468192Z                            module_map=module_map)
2025-05-07T20:32:42.1468354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1468461Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1468535Z E       ^
2025-05-07T20:32:42.1468896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1468901Z 
2025-05-07T20:32:42.1469364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1469412Z 
2025-05-07T20:32:42.1469517Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1469756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1469837Z     T=128,
2025-05-07T20:32:42.1469915Z     D=7168,
2025-05-07T20:32:42.1470006Z     scale_ub=None,
2025-05-07T20:32:42.1470094Z     contiguous=False,
2025-05-07T20:32:42.1470178Z     compiled=False,
2025-05-07T20:32:42.1470261Z )
2025-05-07T20:32:42.1470481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1470656Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.1470660Z 
2025-05-07T20:32:42.1470746Z     @given(
2025-05-07T20:32:42.1470905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1471013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1471132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1471250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1471371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1471443Z     )
2025-05-07T20:32:42.1471689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1471790Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1471866Z         self,
2025-05-07T20:32:42.1471943Z         T: int,
2025-05-07T20:32:42.1472024Z         D: int,
2025-05-07T20:32:42.1472121Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1472218Z         contiguous: bool,
2025-05-07T20:32:42.1472302Z         compiled: bool,
2025-05-07T20:32:42.1472379Z     ) -> None:
2025-05-07T20:32:42.1472483Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1472557Z     
2025-05-07T20:32:42.1472729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1472815Z     
2025-05-07T20:32:42.1472951Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1473077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1473174Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1473255Z         x0 = x[:, :D]
2025-05-07T20:32:42.1473331Z         x1 = x[:, D:]
2025-05-07T20:32:42.1473410Z     
2025-05-07T20:32:42.1473492Z         if contiguous:
2025-05-07T20:32:42.1473580Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1473676Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1473747Z     
2025-05-07T20:32:42.1473843Z         if scale_ub is not None:
2025-05-07T20:32:42.1473949Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1474083Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1474167Z             )
2025-05-07T20:32:42.1474245Z         else:
2025-05-07T20:32:42.1474338Z             scale_ub_tensor = None
2025-05-07T20:32:42.1474424Z     
2025-05-07T20:32:42.1474559Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1474654Z             op = silu_mul_quant
2025-05-07T20:32:42.1474745Z             if compiled:
2025-05-07T20:32:42.1474845Z                 op = torch.compile(op)
2025-05-07T20:32:42.1474949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1475031Z     
2025-05-07T20:32:42.1475121Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1475125Z 
2025-05-07T20:32:42.1475224Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1475355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1475454Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1475563Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1476068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1476162Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1476603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1476867Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1477216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1477326Z     kernel = self.compile(
2025-05-07T20:32:42.1477742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1477922Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1478048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1478053Z 
2025-05-07T20:32:42.1478258Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a98fecd0>
2025-05-07T20:32:42.1479084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1479593Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9348e00>}
2025-05-07T20:32:42.1480350Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1480540Z context = <triton._C.libtriton.ir.context object at 0x7fb3a98f32f0>
2025-05-07T20:32:42.1480545Z 
2025-05-07T20:32:42.1480716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1480981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1481089Z                            module_map=module_map)
2025-05-07T20:32:42.1481258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1481402Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1481480Z E       ^
2025-05-07T20:32:42.1481842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1481848Z 
2025-05-07T20:32:42.1482264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1482268Z 
2025-05-07T20:32:42.1482377Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1482600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1482673Z     T=4096,
2025-05-07T20:32:42.1482754Z     D=5120,
2025-05-07T20:32:42.1482837Z     scale_ub=1200.0,
2025-05-07T20:32:42.1482921Z     contiguous=True,
2025-05-07T20:32:42.1483009Z     compiled=False,
2025-05-07T20:32:42.1483083Z )
2025-05-07T20:32:42.1483315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1483494Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.1483499Z 
2025-05-07T20:32:42.1483687Z     @given(
2025-05-07T20:32:42.1483813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1483910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1484022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1484140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1484251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1484329Z     )
2025-05-07T20:32:42.1484574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1484666Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1484745Z         self,
2025-05-07T20:32:42.1484820Z         T: int,
2025-05-07T20:32:42.1484893Z         D: int,
2025-05-07T20:32:42.1484996Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1485134Z         contiguous: bool,
2025-05-07T20:32:42.1485278Z         compiled: bool,
2025-05-07T20:32:42.1485358Z     ) -> None:
2025-05-07T20:32:42.1485451Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1485524Z     
2025-05-07T20:32:42.1485696Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1485771Z     
2025-05-07T20:32:42.1485863Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1485994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1486081Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1486168Z         x0 = x[:, :D]
2025-05-07T20:32:42.1486245Z         x1 = x[:, D:]
2025-05-07T20:32:42.1486320Z     
2025-05-07T20:32:42.1486409Z         if contiguous:
2025-05-07T20:32:42.1486498Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1486585Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1486661Z     
2025-05-07T20:32:42.1486792Z         if scale_ub is not None:
2025-05-07T20:32:42.1486902Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1487047Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1487122Z             )
2025-05-07T20:32:42.1487197Z         else:
2025-05-07T20:32:42.1487302Z             scale_ub_tensor = None
2025-05-07T20:32:42.1487382Z     
2025-05-07T20:32:42.1487533Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1487642Z             op = silu_mul_quant
2025-05-07T20:32:42.1487725Z             if compiled:
2025-05-07T20:32:42.1487831Z                 op = torch.compile(op)
2025-05-07T20:32:42.1487937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1488009Z     
2025-05-07T20:32:42.1488104Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1488109Z 
2025-05-07T20:32:42.1488203Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1488331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1488441Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1488542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1489098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1489193Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1489556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1489787Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1490129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1490219Z     kernel = self.compile(
2025-05-07T20:32:42.1490613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1490789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1490924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1490933Z 
2025-05-07T20:32:42.1491138Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8a27810>
2025-05-07T20:32:42.1491910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1492422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a9349f80>}
2025-05-07T20:32:42.1493176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1493377Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8a2be70>
2025-05-07T20:32:42.1493382Z 
2025-05-07T20:32:42.1493592Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1493903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1494008Z                            module_map=module_map)
2025-05-07T20:32:42.1494167Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1494270Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1494347Z E       ^
2025-05-07T20:32:42.1494702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1494706Z 
2025-05-07T20:32:42.1495129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1495134Z 
2025-05-07T20:32:42.1495234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1495505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1495586Z     T=1,
2025-05-07T20:32:42.1495663Z     D=5120,
2025-05-07T20:32:42.1495746Z     scale_ub=None,
2025-05-07T20:32:42.1495830Z     contiguous=True,
2025-05-07T20:32:42.1495910Z     compiled=True,
2025-05-07T20:32:42.1495987Z )
2025-05-07T20:32:42.1496205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1496364Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1496375Z 
2025-05-07T20:32:42.1496448Z     @given(
2025-05-07T20:32:42.1496568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1496669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1496781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1496893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1497011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1497086Z     )
2025-05-07T20:32:42.1497333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1497477Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1497553Z         self,
2025-05-07T20:32:42.1497629Z         T: int,
2025-05-07T20:32:42.1497712Z         D: int,
2025-05-07T20:32:42.1497807Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1497899Z         contiguous: bool,
2025-05-07T20:32:42.1497982Z         compiled: bool,
2025-05-07T20:32:42.1498056Z     ) -> None:
2025-05-07T20:32:42.1498155Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1498228Z     
2025-05-07T20:32:42.1498397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1498479Z     
2025-05-07T20:32:42.1498568Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1498692Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1498786Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1498866Z         x0 = x[:, :D]
2025-05-07T20:32:42.1498945Z         x1 = x[:, D:]
2025-05-07T20:32:42.1499024Z     
2025-05-07T20:32:42.1499111Z         if contiguous:
2025-05-07T20:32:42.1499202Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1499295Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1499368Z     
2025-05-07T20:32:42.1499465Z         if scale_ub is not None:
2025-05-07T20:32:42.1499570Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1499704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1499787Z             )
2025-05-07T20:32:42.1499861Z         else:
2025-05-07T20:32:42.1499954Z             scale_ub_tensor = None
2025-05-07T20:32:42.1500034Z     
2025-05-07T20:32:42.1500163Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1500253Z             op = silu_mul_quant
2025-05-07T20:32:42.1500344Z             if compiled:
2025-05-07T20:32:42.1500443Z                 op = torch.compile(op)
2025-05-07T20:32:42.1500548Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1500626Z     
2025-05-07T20:32:42.1500761Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1500930Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1501002Z     
2025-05-07T20:32:42.1501140Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1501251Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1501353Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1501476Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1501625Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1501696Z     
2025-05-07T20:32:42.1501795Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1501807Z 
2025-05-07T20:32:42.1501904Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1502033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1502146Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1502321Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1502892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1502998Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1503364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1503593Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1503964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1504224Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1504633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1504892Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1505355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1505530Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1505875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1505959Z     fn()
2025-05-07T20:32:42.1506363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1506444Z     self.fn.run(
2025-05-07T20:32:42.1506791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1506884Z     kernel = self.compile(
2025-05-07T20:32:42.1507270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1507455Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1507585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1507590Z 
2025-05-07T20:32:42.1507803Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8aa1a90>
2025-05-07T20:32:42.1508576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1509087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a934afc0>}
2025-05-07T20:32:42.1509843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1510076Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8a660b0>
2025-05-07T20:32:42.1510122Z 
2025-05-07T20:32:42.1510299Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1510565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1510674Z                            module_map=module_map)
2025-05-07T20:32:42.1510835Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1510935Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1511018Z E       ^
2025-05-07T20:32:42.1511375Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1511380Z 
2025-05-07T20:32:42.1511798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1511851Z 
2025-05-07T20:32:42.1511958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1512182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1512267Z     T=2048,
2025-05-07T20:32:42.1512340Z     D=5120,
2025-05-07T20:32:42.1512418Z     scale_ub=None,
2025-05-07T20:32:42.1512508Z     contiguous=True,
2025-05-07T20:32:42.1512590Z     compiled=True,
2025-05-07T20:32:42.1512662Z )
2025-05-07T20:32:42.1512888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1513058Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1513062Z 
2025-05-07T20:32:42.1513139Z     @given(
2025-05-07T20:32:42.1513261Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1513357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1513475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1513591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1513714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1513839Z     )
2025-05-07T20:32:42.1514089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1514180Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1514263Z         self,
2025-05-07T20:32:42.1514338Z         T: int,
2025-05-07T20:32:42.1514411Z         D: int,
2025-05-07T20:32:42.1514517Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1514606Z         contiguous: bool,
2025-05-07T20:32:42.1514690Z         compiled: bool,
2025-05-07T20:32:42.1514774Z     ) -> None:
2025-05-07T20:32:42.1514866Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1514939Z     
2025-05-07T20:32:42.1515114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1515186Z     
2025-05-07T20:32:42.1515277Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1515411Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1515498Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1515587Z         x0 = x[:, :D]
2025-05-07T20:32:42.1515668Z         x1 = x[:, D:]
2025-05-07T20:32:42.1515740Z     
2025-05-07T20:32:42.1515831Z         if contiguous:
2025-05-07T20:32:42.1515919Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1516008Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1516087Z     
2025-05-07T20:32:42.1516176Z         if scale_ub is not None:
2025-05-07T20:32:42.1516278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1516418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1516493Z             )
2025-05-07T20:32:42.1516570Z         else:
2025-05-07T20:32:42.1516669Z             scale_ub_tensor = None
2025-05-07T20:32:42.1516739Z     
2025-05-07T20:32:42.1516875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1516962Z             op = silu_mul_quant
2025-05-07T20:32:42.1517044Z             if compiled:
2025-05-07T20:32:42.1517150Z                 op = torch.compile(op)
2025-05-07T20:32:42.1517306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1517436Z     
2025-05-07T20:32:42.1517542Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1517685Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1517756Z     
2025-05-07T20:32:42.1517896Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1517994Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1518093Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1518221Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1518360Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1518439Z     
2025-05-07T20:32:42.1518538Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1518542Z 
2025-05-07T20:32:42.1518638Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1518812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1518921Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1519054Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1519624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1519723Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1520092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1520313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1520683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1522391Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1522799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1523107Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1523488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1523758Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1524109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1524185Z     fn()
2025-05-07T20:32:42.1524592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1524679Z     self.fn.run(
2025-05-07T20:32:42.1525019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1525118Z     kernel = self.compile(
2025-05-07T20:32:42.1525508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1525690Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1525822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1525827Z 
2025-05-07T20:32:42.1526032Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a89f1910>
2025-05-07T20:32:42.1526810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1527313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a90d07c0>}
2025-05-07T20:32:42.1528164Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1528403Z context = <triton._C.libtriton.ir.context object at 0x7fb3a986d8b0>
2025-05-07T20:32:42.1528408Z 
2025-05-07T20:32:42.1528572Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1528843Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1528948Z                            module_map=module_map)
2025-05-07T20:32:42.1529108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1529212Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1529288Z E       ^
2025-05-07T20:32:42.1529643Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1529655Z 
2025-05-07T20:32:42.1530123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1530132Z 
2025-05-07T20:32:42.1530234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1530462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1530540Z     T=128,
2025-05-07T20:32:42.1530617Z     D=5120,
2025-05-07T20:32:42.1530705Z     scale_ub=None,
2025-05-07T20:32:42.1530788Z     contiguous=True,
2025-05-07T20:32:42.1530868Z     compiled=True,
2025-05-07T20:32:42.1530945Z )
2025-05-07T20:32:42.1531163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1531336Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1531341Z 
2025-05-07T20:32:42.1531417Z     @given(
2025-05-07T20:32:42.1531535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1531639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1531755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1531871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1532036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1532109Z     )
2025-05-07T20:32:42.1532360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1532450Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1532526Z         self,
2025-05-07T20:32:42.1532609Z         T: int,
2025-05-07T20:32:42.1532684Z         D: int,
2025-05-07T20:32:42.1532778Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1532872Z         contiguous: bool,
2025-05-07T20:32:42.1532957Z         compiled: bool,
2025-05-07T20:32:42.1533031Z     ) -> None:
2025-05-07T20:32:42.1533134Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1533208Z     
2025-05-07T20:32:42.1533376Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1533461Z     
2025-05-07T20:32:42.1533553Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1533680Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1533780Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1533858Z         x0 = x[:, :D]
2025-05-07T20:32:42.1533941Z         x1 = x[:, D:]
2025-05-07T20:32:42.1534013Z     
2025-05-07T20:32:42.1534094Z         if contiguous:
2025-05-07T20:32:42.1534189Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1534275Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1534350Z     
2025-05-07T20:32:42.1534444Z         if scale_ub is not None:
2025-05-07T20:32:42.1534548Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1534681Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1534763Z             )
2025-05-07T20:32:42.1534839Z         else:
2025-05-07T20:32:42.1534930Z             scale_ub_tensor = None
2025-05-07T20:32:42.1535009Z     
2025-05-07T20:32:42.1535136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1535231Z             op = silu_mul_quant
2025-05-07T20:32:42.1535358Z             if compiled:
2025-05-07T20:32:42.1535495Z                 op = torch.compile(op)
2025-05-07T20:32:42.1535603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1535673Z     
2025-05-07T20:32:42.1535761Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1535885Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1535954Z     
2025-05-07T20:32:42.1536089Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1536192Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1536288Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1536407Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1536550Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1536623Z     
2025-05-07T20:32:42.1536727Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1536731Z 
2025-05-07T20:32:42.1536889Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1537018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1537134Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1537265Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1537829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1537932Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1538295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1539046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1539427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1539689Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1540099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1540480Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1540866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1541031Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1542807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1542890Z     fn()
2025-05-07T20:32:42.1543295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1543377Z     self.fn.run(
2025-05-07T20:32:42.1543727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1543818Z     kernel = self.compile(
2025-05-07T20:32:42.1544214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1544388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1544515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1544520Z 
2025-05-07T20:32:42.1544730Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8758b10>
2025-05-07T20:32:42.1545505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1546015Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8ddde40>}
2025-05-07T20:32:42.1546835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1547093Z context = <triton._C.libtriton.ir.context object at 0x7fb3a875d070>
2025-05-07T20:32:42.1547105Z 
2025-05-07T20:32:42.1547281Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1547584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1547695Z                            module_map=module_map)
2025-05-07T20:32:42.1547855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1547954Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1548035Z E       ^
2025-05-07T20:32:42.1548459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1548464Z 
2025-05-07T20:32:42.1548894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1548901Z 
2025-05-07T20:32:42.1549004Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1549223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1549305Z     T=4096,
2025-05-07T20:32:42.1549383Z     D=5120,
2025-05-07T20:32:42.1549461Z     scale_ub=None,
2025-05-07T20:32:42.1549552Z     contiguous=True,
2025-05-07T20:32:42.1549633Z     compiled=True,
2025-05-07T20:32:42.1549705Z )
2025-05-07T20:32:42.1549931Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1550100Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1550104Z 
2025-05-07T20:32:42.1550186Z     @given(
2025-05-07T20:32:42.1550307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1550407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1550570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1550689Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1550800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1550881Z     )
2025-05-07T20:32:42.1551124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1551223Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1551300Z         self,
2025-05-07T20:32:42.1551376Z         T: int,
2025-05-07T20:32:42.1551457Z         D: int,
2025-05-07T20:32:42.1551553Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1551640Z         contiguous: bool,
2025-05-07T20:32:42.1551730Z         compiled: bool,
2025-05-07T20:32:42.1551810Z     ) -> None:
2025-05-07T20:32:42.1551906Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1551985Z     
2025-05-07T20:32:42.1552157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1552232Z     
2025-05-07T20:32:42.1552333Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1552459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1552548Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1552635Z         x0 = x[:, :D]
2025-05-07T20:32:42.1552716Z         x1 = x[:, D:]
2025-05-07T20:32:42.1552792Z     
2025-05-07T20:32:42.1552875Z         if contiguous:
2025-05-07T20:32:42.1552962Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1553059Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1553127Z     
2025-05-07T20:32:42.1553215Z         if scale_ub is not None:
2025-05-07T20:32:42.1553326Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1553457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1553532Z             )
2025-05-07T20:32:42.1553613Z         else:
2025-05-07T20:32:42.1553704Z             scale_ub_tensor = None
2025-05-07T20:32:42.1553779Z     
2025-05-07T20:32:42.1553964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1554098Z             op = silu_mul_quant
2025-05-07T20:32:42.1554188Z             if compiled:
2025-05-07T20:32:42.1554284Z                 op = torch.compile(op)
2025-05-07T20:32:42.1554386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1554463Z     
2025-05-07T20:32:42.1554551Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1554670Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1554747Z     
2025-05-07T20:32:42.1554880Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1554980Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1555083Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1555204Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1555345Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1555465Z     
2025-05-07T20:32:42.1555566Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1555576Z 
2025-05-07T20:32:42.1555680Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1555809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1555913Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1556054Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1556615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1556716Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1557087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1557312Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1557692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1557950Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1558399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1558665Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1559045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1559217Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1559559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1559636Z     fn()
2025-05-07T20:32:42.1560048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1560131Z     self.fn.run(
2025-05-07T20:32:42.1560476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1560585Z     kernel = self.compile(
2025-05-07T20:32:42.1560973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1561156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1561282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1561286Z 
2025-05-07T20:32:42.1561493Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a82b1910>
2025-05-07T20:32:42.1562270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1562817Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a887e700>}
2025-05-07T20:32:42.1563729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1563923Z context = <triton._C.libtriton.ir.context object at 0x7fb3a9187130>
2025-05-07T20:32:42.1563928Z 
2025-05-07T20:32:42.1564097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1564360Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1564470Z                            module_map=module_map)
2025-05-07T20:32:42.1564638Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1564738Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1564857Z E       ^
2025-05-07T20:32:42.1565225Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1565234Z 
2025-05-07T20:32:42.1565653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1565657Z 
2025-05-07T20:32:42.1565768Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1565989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1566065Z     T=16384,
2025-05-07T20:32:42.1566148Z     D=5120,
2025-05-07T20:32:42.1566229Z     scale_ub=None,
2025-05-07T20:32:42.1566312Z     contiguous=True,
2025-05-07T20:32:42.1566400Z     compiled=True,
2025-05-07T20:32:42.1566471Z )
2025-05-07T20:32:42.1566689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1566870Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1566877Z 
2025-05-07T20:32:42.1566953Z     @given(
2025-05-07T20:32:42.1567085Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1567226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1567340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1567463Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1567576Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1567650Z     )
2025-05-07T20:32:42.1567904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1568000Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1568085Z         self,
2025-05-07T20:32:42.1568163Z         T: int,
2025-05-07T20:32:42.1568242Z         D: int,
2025-05-07T20:32:42.1568341Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1568432Z         contiguous: bool,
2025-05-07T20:32:42.1568516Z         compiled: bool,
2025-05-07T20:32:42.1568600Z     ) -> None:
2025-05-07T20:32:42.1568706Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1568779Z     
2025-05-07T20:32:42.1568960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1569038Z     
2025-05-07T20:32:42.1569126Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1569260Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1569347Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1569425Z         x0 = x[:, :D]
2025-05-07T20:32:42.1569508Z         x1 = x[:, D:]
2025-05-07T20:32:42.1569583Z     
2025-05-07T20:32:42.1569671Z         if contiguous:
2025-05-07T20:32:42.1569758Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1569847Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1569930Z     
2025-05-07T20:32:42.1570019Z         if scale_ub is not None:
2025-05-07T20:32:42.1570123Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1570259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1570341Z             )
2025-05-07T20:32:42.1570417Z         else:
2025-05-07T20:32:42.1570562Z             scale_ub_tensor = None
2025-05-07T20:32:42.1570710Z     
2025-05-07T20:32:42.1570838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1570931Z             op = silu_mul_quant
2025-05-07T20:32:42.1571014Z             if compiled:
2025-05-07T20:32:42.1571117Z                 op = torch.compile(op)
2025-05-07T20:32:42.1571220Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1571291Z     
2025-05-07T20:32:42.1571387Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1571506Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1571574Z     
2025-05-07T20:32:42.1571713Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1571815Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1571912Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1572040Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1572222Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1572303Z     
2025-05-07T20:32:42.1572411Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1572416Z 
2025-05-07T20:32:42.1572513Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1572647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1572754Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1572890Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1573461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1573560Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1573922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1574151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1574523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1574829Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1575228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1575482Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1575863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1576028Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1576379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1576456Z     fn()
2025-05-07T20:32:42.1576865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1576960Z     self.fn.run(
2025-05-07T20:32:42.1577305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1577400Z     kernel = self.compile(
2025-05-07T20:32:42.1577796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1577972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1578108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1578112Z 
2025-05-07T20:32:42.1578317Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a806c8d0>
2025-05-07T20:32:42.1579092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1579643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8415d00>}
2025-05-07T20:32:42.1580436Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1580640Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8080eb0>
2025-05-07T20:32:42.1580645Z 
2025-05-07T20:32:42.1580810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1592988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1593125Z                            module_map=module_map)
2025-05-07T20:32:42.1593377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1593496Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1593574Z E       ^
2025-05-07T20:32:42.1593950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1593956Z 
2025-05-07T20:32:42.1594380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1594385Z 
2025-05-07T20:32:42.1594488Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1594722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1594801Z     T=1,
2025-05-07T20:32:42.1594886Z     D=5120,
2025-05-07T20:32:42.1594970Z     scale_ub=1200.0,
2025-05-07T20:32:42.1595055Z     contiguous=True,
2025-05-07T20:32:42.1595147Z     compiled=True,
2025-05-07T20:32:42.1595220Z )
2025-05-07T20:32:42.1595440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1595624Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.1595677Z 
2025-05-07T20:32:42.1595760Z     @given(
2025-05-07T20:32:42.1595882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1595991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1596107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1596233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1596349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1596424Z     )
2025-05-07T20:32:42.1596680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1596777Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1596854Z         self,
2025-05-07T20:32:42.1596944Z         T: int,
2025-05-07T20:32:42.1597022Z         D: int,
2025-05-07T20:32:42.1597123Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1597223Z         contiguous: bool,
2025-05-07T20:32:42.1597323Z         compiled: bool,
2025-05-07T20:32:42.1597421Z     ) -> None:
2025-05-07T20:32:42.1597549Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1597630Z     
2025-05-07T20:32:42.1597813Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1597891Z     
2025-05-07T20:32:42.1597985Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1598119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1598210Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1598293Z         x0 = x[:, :D]
2025-05-07T20:32:42.1598380Z         x1 = x[:, D:]
2025-05-07T20:32:42.1598453Z     
2025-05-07T20:32:42.1598543Z         if contiguous:
2025-05-07T20:32:42.1598645Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1598735Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1598810Z     
2025-05-07T20:32:42.1598916Z         if scale_ub is not None:
2025-05-07T20:32:42.1599024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1599162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1599299Z             )
2025-05-07T20:32:42.1599421Z         else:
2025-05-07T20:32:42.1599527Z             scale_ub_tensor = None
2025-05-07T20:32:42.1599599Z     
2025-05-07T20:32:42.1599730Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1599828Z             op = silu_mul_quant
2025-05-07T20:32:42.1599911Z             if compiled:
2025-05-07T20:32:42.1600012Z                 op = torch.compile(op)
2025-05-07T20:32:42.1600127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1600201Z     
2025-05-07T20:32:42.1600294Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1600298Z 
2025-05-07T20:32:42.1600408Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1600539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1600649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1600748Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1601170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1601283Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1601786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1601885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1602256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1602482Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1602835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1602930Z     kernel = self.compile(
2025-05-07T20:32:42.1603316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1603510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1603808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1603815Z 
2025-05-07T20:32:42.1604032Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287992610>
2025-05-07T20:32:42.1604804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1605307Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8670ae0>}
2025-05-07T20:32:42.1606067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1606258Z context = <triton._C.libtriton.ir.context object at 0x7fb2879a27b0>
2025-05-07T20:32:42.1606266Z 
2025-05-07T20:32:42.1606440Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1606704Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1606812Z                            module_map=module_map)
2025-05-07T20:32:42.1606981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1607079Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1607154Z E       ^
2025-05-07T20:32:42.1607518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1607523Z 
2025-05-07T20:32:42.1607939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1607944Z 
2025-05-07T20:32:42.1608054Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1608329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1608443Z     T=1,
2025-05-07T20:32:42.1608527Z     D=5120,
2025-05-07T20:32:42.1608609Z     scale_ub=None,
2025-05-07T20:32:42.1608703Z     contiguous=False,
2025-05-07T20:32:42.1608786Z     compiled=True,
2025-05-07T20:32:42.1608860Z )
2025-05-07T20:32:42.1609084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1609249Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1609254Z 
2025-05-07T20:32:42.1609331Z     @given(
2025-05-07T20:32:42.1609461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1609563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1609676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1609803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1609956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1610048Z     )
2025-05-07T20:32:42.1610295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1610390Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1610473Z         self,
2025-05-07T20:32:42.1610549Z         T: int,
2025-05-07T20:32:42.1610625Z         D: int,
2025-05-07T20:32:42.1610733Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1610822Z         contiguous: bool,
2025-05-07T20:32:42.1610906Z         compiled: bool,
2025-05-07T20:32:42.1610996Z     ) -> None:
2025-05-07T20:32:42.1611091Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1611165Z     
2025-05-07T20:32:42.1611346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1611421Z     
2025-05-07T20:32:42.1611522Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1611646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1611738Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1611828Z         x0 = x[:, :D]
2025-05-07T20:32:42.1611953Z         x1 = x[:, D:]
2025-05-07T20:32:42.1612030Z     
2025-05-07T20:32:42.1612124Z         if contiguous:
2025-05-07T20:32:42.1612215Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1612304Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1612386Z     
2025-05-07T20:32:42.1612477Z         if scale_ub is not None:
2025-05-07T20:32:42.1612585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1612727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1612805Z             )
2025-05-07T20:32:42.1612881Z         else:
2025-05-07T20:32:42.1612987Z             scale_ub_tensor = None
2025-05-07T20:32:42.1613061Z     
2025-05-07T20:32:42.1613199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1613289Z             op = silu_mul_quant
2025-05-07T20:32:42.1613375Z             if compiled:
2025-05-07T20:32:42.1613488Z                 op = torch.compile(op)
2025-05-07T20:32:42.1613596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1613675Z     
2025-05-07T20:32:42.1613775Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1613900Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1613974Z     
2025-05-07T20:32:42.1614122Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1614225Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1614326Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1614457Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1614597Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1614678Z     
2025-05-07T20:32:42.1614778Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1614782Z 
2025-05-07T20:32:42.1614883Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1615011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1615196Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1615333Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1615941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1616044Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1616407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1616639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1617007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1617262Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1617737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1618022Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1618410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1618577Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1618920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1619001Z     fn()
2025-05-07T20:32:42.1619400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1619490Z     self.fn.run(
2025-05-07T20:32:42.1619830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1619922Z     kernel = self.compile(
2025-05-07T20:32:42.1620318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1620538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1620667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1620677Z 
2025-05-07T20:32:42.1620882Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28795e410>
2025-05-07T20:32:42.1621653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1622161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8699e40>}
2025-05-07T20:32:42.1622911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1623109Z context = <triton._C.libtriton.ir.context object at 0x7fb2879dea30>
2025-05-07T20:32:42.1623114Z 
2025-05-07T20:32:42.1623274Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1623537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1623647Z                            module_map=module_map)
2025-05-07T20:32:42.1623810Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1623909Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1623993Z E       ^
2025-05-07T20:32:42.1624350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1624354Z 
2025-05-07T20:32:42.1624822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1624830Z 
2025-05-07T20:32:42.1624972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1625191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1625274Z     T=1,
2025-05-07T20:32:42.1625347Z     D=5120,
2025-05-07T20:32:42.1625433Z     scale_ub=None,
2025-05-07T20:32:42.1625516Z     contiguous=True,
2025-05-07T20:32:42.1625597Z     compiled=False,
2025-05-07T20:32:42.1625675Z )
2025-05-07T20:32:42.1625889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1626050Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.1626054Z 
2025-05-07T20:32:42.1626136Z     @given(
2025-05-07T20:32:42.1626253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1626348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1626513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1626631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1626752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1626825Z     )
2025-05-07T20:32:42.1627068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1627163Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1627239Z         self,
2025-05-07T20:32:42.1627312Z         T: int,
2025-05-07T20:32:42.1627391Z         D: int,
2025-05-07T20:32:42.1627486Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1627571Z         contiguous: bool,
2025-05-07T20:32:42.1627659Z         compiled: bool,
2025-05-07T20:32:42.1627737Z     ) -> None:
2025-05-07T20:32:42.1627829Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1627908Z     
2025-05-07T20:32:42.1628078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1628153Z     
2025-05-07T20:32:42.1628246Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1628372Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1628513Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1628596Z         x0 = x[:, :D]
2025-05-07T20:32:42.1628678Z         x1 = x[:, D:]
2025-05-07T20:32:42.1628756Z     
2025-05-07T20:32:42.1628838Z         if contiguous:
2025-05-07T20:32:42.1628926Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1629019Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1629091Z     
2025-05-07T20:32:42.1629179Z         if scale_ub is not None:
2025-05-07T20:32:42.1629292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1629426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1629501Z             )
2025-05-07T20:32:42.1629585Z         else:
2025-05-07T20:32:42.1629676Z             scale_ub_tensor = None
2025-05-07T20:32:42.1629751Z     
2025-05-07T20:32:42.1629882Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1629974Z             op = silu_mul_quant
2025-05-07T20:32:42.1630071Z             if compiled:
2025-05-07T20:32:42.1630173Z                 op = torch.compile(op)
2025-05-07T20:32:42.1630280Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1630358Z     
2025-05-07T20:32:42.1630447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1630451Z 
2025-05-07T20:32:42.1630550Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1630683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1630783Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1630891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1631394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1631492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1631860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1632129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1632575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1632673Z     kernel = self.compile(
2025-05-07T20:32:42.1633058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1633239Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1633365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1633369Z 
2025-05-07T20:32:42.1633574Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28790e4d0>
2025-05-07T20:32:42.1634389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1634894Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a869b740>}
2025-05-07T20:32:42.1635659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1635847Z context = <triton._C.libtriton.ir.context object at 0x7fb2879f29f0>
2025-05-07T20:32:42.1635851Z 
2025-05-07T20:32:42.1636022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1636284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1636388Z                            module_map=module_map)
2025-05-07T20:32:42.1636555Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1636654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1636732Z E       ^
2025-05-07T20:32:42.1637138Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1637147Z 
2025-05-07T20:32:42.1637562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1637566Z 
2025-05-07T20:32:42.1637673Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1637892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1637966Z     T=128,
2025-05-07T20:32:42.1638046Z     D=5120,
2025-05-07T20:32:42.1638126Z     scale_ub=None,
2025-05-07T20:32:42.1638209Z     contiguous=False,
2025-05-07T20:32:42.1638297Z     compiled=True,
2025-05-07T20:32:42.1638372Z )
2025-05-07T20:32:42.1639110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1639295Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1639305Z 
2025-05-07T20:32:42.1639382Z     @given(
2025-05-07T20:32:42.1639510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1639606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1639719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1639841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1639953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1640027Z     )
2025-05-07T20:32:42.1640280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1640373Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1640450Z         self,
2025-05-07T20:32:42.1640532Z         T: int,
2025-05-07T20:32:42.1640605Z         D: int,
2025-05-07T20:32:42.1640711Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1640797Z         contiguous: bool,
2025-05-07T20:32:42.1640883Z         compiled: bool,
2025-05-07T20:32:42.1640965Z     ) -> None:
2025-05-07T20:32:42.1641232Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1641369Z     
2025-05-07T20:32:42.1641546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1641616Z     
2025-05-07T20:32:42.1641706Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1641837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1641923Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1642001Z         x0 = x[:, :D]
2025-05-07T20:32:42.1642084Z         x1 = x[:, D:]
2025-05-07T20:32:42.1642157Z     
2025-05-07T20:32:42.1642244Z         if contiguous:
2025-05-07T20:32:42.1642334Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1642420Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1642502Z     
2025-05-07T20:32:42.1642592Z         if scale_ub is not None:
2025-05-07T20:32:42.1642695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1642907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1642989Z             )
2025-05-07T20:32:42.1643065Z         else:
2025-05-07T20:32:42.1643169Z             scale_ub_tensor = None
2025-05-07T20:32:42.1643241Z     
2025-05-07T20:32:42.1643371Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1643469Z             op = silu_mul_quant
2025-05-07T20:32:42.1643658Z             if compiled:
2025-05-07T20:32:42.1643764Z                 op = torch.compile(op)
2025-05-07T20:32:42.1643868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1643938Z     
2025-05-07T20:32:42.1644033Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1644037Z 
2025-05-07T20:32:42.1644132Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1644263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1644369Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1644468Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1644845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1645044Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1645540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1645642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1646002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1646223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1646573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1646664Z     kernel = self.compile(
2025-05-07T20:32:42.1647056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1647232Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1647362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1647368Z 
2025-05-07T20:32:42.1647579Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287bca350>
2025-05-07T20:32:42.1648346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1648850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287d09120>}
2025-05-07T20:32:42.1649603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1649837Z context = <triton._C.libtriton.ir.context object at 0x7fb287b0a7b0>
2025-05-07T20:32:42.1649881Z 
2025-05-07T20:32:42.1650055Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1650320Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1650431Z                            module_map=module_map)
2025-05-07T20:32:42.1650590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1650686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1650768Z E       ^
2025-05-07T20:32:42.1651123Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1651127Z 
2025-05-07T20:32:42.1651542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1651553Z 
2025-05-07T20:32:42.1651696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1651922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1652012Z     T=128,
2025-05-07T20:32:42.1652088Z     D=7168,
2025-05-07T20:32:42.1652170Z     scale_ub=1200.0,
2025-05-07T20:32:42.1652262Z     contiguous=False,
2025-05-07T20:32:42.1652344Z     compiled=False,
2025-05-07T20:32:42.1652416Z )
2025-05-07T20:32:42.1652638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1652809Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1652814Z 
2025-05-07T20:32:42.1652893Z     @given(
2025-05-07T20:32:42.1653008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1653104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1653223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1653336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1653448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1653534Z     )
2025-05-07T20:32:42.1653824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1653918Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1653999Z         self,
2025-05-07T20:32:42.1654072Z         T: int,
2025-05-07T20:32:42.1654149Z         D: int,
2025-05-07T20:32:42.1654263Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1654350Z         contiguous: bool,
2025-05-07T20:32:42.1654435Z         compiled: bool,
2025-05-07T20:32:42.1654518Z     ) -> None:
2025-05-07T20:32:42.1654613Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1654684Z     
2025-05-07T20:32:42.1654867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1654942Z     
2025-05-07T20:32:42.1655042Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1655165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1655254Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1655340Z         x0 = x[:, :D]
2025-05-07T20:32:42.1655426Z         x1 = x[:, D:]
2025-05-07T20:32:42.1655506Z     
2025-05-07T20:32:42.1655593Z         if contiguous:
2025-05-07T20:32:42.1655682Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1655767Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1655847Z     
2025-05-07T20:32:42.1655936Z         if scale_ub is not None:
2025-05-07T20:32:42.1656036Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1656177Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1656250Z             )
2025-05-07T20:32:42.1656326Z         else:
2025-05-07T20:32:42.1656426Z             scale_ub_tensor = None
2025-05-07T20:32:42.1656499Z     
2025-05-07T20:32:42.1656635Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1656724Z             op = silu_mul_quant
2025-05-07T20:32:42.1656807Z             if compiled:
2025-05-07T20:32:42.1656915Z                 op = torch.compile(op)
2025-05-07T20:32:42.1657074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1657152Z     
2025-05-07T20:32:42.1657294Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1657298Z 
2025-05-07T20:32:42.1657394Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1657523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1657628Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1657726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1658232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1658329Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1658691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1658922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1659307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1659406Z     kernel = self.compile(
2025-05-07T20:32:42.1659800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1659974Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1660110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1660115Z 
2025-05-07T20:32:42.1660321Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287b42c10>
2025-05-07T20:32:42.1661090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1661602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287d08360>}
2025-05-07T20:32:42.1662399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1662597Z context = <triton._C.libtriton.ir.context object at 0x7fb287bae5f0>
2025-05-07T20:32:42.1662601Z 
2025-05-07T20:32:42.1662766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1663039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1663148Z                            module_map=module_map)
2025-05-07T20:32:42.1663309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1663414Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1663491Z E       ^
2025-05-07T20:32:42.1663850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1663860Z 
2025-05-07T20:32:42.1664282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1664287Z 
2025-05-07T20:32:42.1664388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1664617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1664695Z     T=128,
2025-05-07T20:32:42.1664771Z     D=5120,
2025-05-07T20:32:42.1664860Z     scale_ub=None,
2025-05-07T20:32:42.1664946Z     contiguous=False,
2025-05-07T20:32:42.1665030Z     compiled=False,
2025-05-07T20:32:42.1665108Z )
2025-05-07T20:32:42.1665327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1665495Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.1665510Z 
2025-05-07T20:32:42.1665588Z     @given(
2025-05-07T20:32:42.1665752Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1665862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1666018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1666132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1666252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1666324Z     )
2025-05-07T20:32:42.1666568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1666665Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1666740Z         self,
2025-05-07T20:32:42.1666815Z         T: int,
2025-05-07T20:32:42.1666902Z         D: int,
2025-05-07T20:32:42.1666996Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1667093Z         contiguous: bool,
2025-05-07T20:32:42.1667179Z         compiled: bool,
2025-05-07T20:32:42.1667255Z     ) -> None:
2025-05-07T20:32:42.1667355Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1667472Z     
2025-05-07T20:32:42.1667645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1667736Z     
2025-05-07T20:32:42.1667827Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1667952Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1668047Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1668125Z         x0 = x[:, :D]
2025-05-07T20:32:42.1668203Z         x1 = x[:, D:]
2025-05-07T20:32:42.1668285Z     
2025-05-07T20:32:42.1668369Z         if contiguous:
2025-05-07T20:32:42.1668469Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1668564Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1668638Z     
2025-05-07T20:32:42.1668735Z         if scale_ub is not None:
2025-05-07T20:32:42.1668841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1668975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1669057Z             )
2025-05-07T20:32:42.1669133Z         else:
2025-05-07T20:32:42.1669230Z             scale_ub_tensor = None
2025-05-07T20:32:42.1669314Z     
2025-05-07T20:32:42.1669488Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1669577Z             op = silu_mul_quant
2025-05-07T20:32:42.1669666Z             if compiled:
2025-05-07T20:32:42.1669764Z                 op = torch.compile(op)
2025-05-07T20:32:42.1669876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1669947Z     
2025-05-07T20:32:42.1670036Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1670041Z 
2025-05-07T20:32:42.1670144Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1670271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1670371Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1670475Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1670978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1671071Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1671444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1671669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1672020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1672113Z     kernel = self.compile(
2025-05-07T20:32:42.1672502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1672683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1672808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1672813Z 
2025-05-07T20:32:42.1673024Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287891410>
2025-05-07T20:32:42.1673840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1674376Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28785c720>}
2025-05-07T20:32:42.1675137Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1675325Z context = <triton._C.libtriton.ir.context object at 0x7fb287811a30>
2025-05-07T20:32:42.1675330Z 
2025-05-07T20:32:42.1675501Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1675822Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1675933Z                            module_map=module_map)
2025-05-07T20:32:42.1676104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1676200Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1676281Z E       ^
2025-05-07T20:32:42.1676635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1676639Z 
2025-05-07T20:32:42.1677053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1677057Z 
2025-05-07T20:32:42.1677167Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1677386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1677466Z     T=128,
2025-05-07T20:32:42.1677542Z     D=5120,
2025-05-07T20:32:42.1677621Z     scale_ub=1200.0,
2025-05-07T20:32:42.1677710Z     contiguous=True,
2025-05-07T20:32:42.1677795Z     compiled=False,
2025-05-07T20:32:42.1677866Z )
2025-05-07T20:32:42.1678136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1678306Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.1678311Z 
2025-05-07T20:32:42.1678384Z     @given(
2025-05-07T20:32:42.1678507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1678604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1678724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1678839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1678949Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1679029Z     )
2025-05-07T20:32:42.1679273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1679365Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1679446Z         self,
2025-05-07T20:32:42.1679524Z         T: int,
2025-05-07T20:32:42.1679598Z         D: int,
2025-05-07T20:32:42.1679705Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1679796Z         contiguous: bool,
2025-05-07T20:32:42.1679881Z         compiled: bool,
2025-05-07T20:32:42.1679962Z     ) -> None:
2025-05-07T20:32:42.1680053Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1680121Z     
2025-05-07T20:32:42.1680295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1680366Z     
2025-05-07T20:32:42.1680460Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1680580Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1680667Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1680751Z         x0 = x[:, :D]
2025-05-07T20:32:42.1680830Z         x1 = x[:, D:]
2025-05-07T20:32:42.1680899Z     
2025-05-07T20:32:42.1680985Z         if contiguous:
2025-05-07T20:32:42.1681073Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1681158Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1681241Z     
2025-05-07T20:32:42.1681379Z         if scale_ub is not None:
2025-05-07T20:32:42.1681490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1681667Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1681742Z             )
2025-05-07T20:32:42.1681823Z         else:
2025-05-07T20:32:42.1681914Z             scale_ub_tensor = None
2025-05-07T20:32:42.1681986Z     
2025-05-07T20:32:42.1682121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1682209Z             op = silu_mul_quant
2025-05-07T20:32:42.1682298Z             if compiled:
2025-05-07T20:32:42.1682401Z                 op = torch.compile(op)
2025-05-07T20:32:42.1682503Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1682575Z     
2025-05-07T20:32:42.1682670Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1682675Z 
2025-05-07T20:32:42.1682772Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1682944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1683048Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1683151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1683749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1683847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1684208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1684433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1684775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1684878Z     kernel = self.compile(
2025-05-07T20:32:42.1685262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1685441Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1685624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1685632Z 
2025-05-07T20:32:42.1685836Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2878cc410>
2025-05-07T20:32:42.1686613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1687110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28785d8a0>}
2025-05-07T20:32:42.1687865Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1688068Z context = <triton._C.libtriton.ir.context object at 0x7fb287864a70>
2025-05-07T20:32:42.1688079Z 
2025-05-07T20:32:42.1688245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1688514Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1688619Z                            module_map=module_map)
2025-05-07T20:32:42.1688779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1688885Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1688960Z E       ^
2025-05-07T20:32:42.1689315Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1689327Z 
2025-05-07T20:32:42.1689741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1689746Z 
2025-05-07T20:32:42.1689850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1690121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1690236Z     T=1,
2025-05-07T20:32:42.1690313Z     D=7168,
2025-05-07T20:32:42.1690405Z     scale_ub=1200.0,
2025-05-07T20:32:42.1690487Z     contiguous=True,
2025-05-07T20:32:42.1690570Z     compiled=True,
2025-05-07T20:32:42.1690653Z )
2025-05-07T20:32:42.1690868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1691043Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.1691047Z 
2025-05-07T20:32:42.1691124Z     @given(
2025-05-07T20:32:42.1691241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1691342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1691453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1691566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1691724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1691799Z     )
2025-05-07T20:32:42.1692048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1692151Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1692228Z         self,
2025-05-07T20:32:42.1692308Z         T: int,
2025-05-07T20:32:42.1692383Z         D: int,
2025-05-07T20:32:42.1692477Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1692571Z         contiguous: bool,
2025-05-07T20:32:42.1692654Z         compiled: bool,
2025-05-07T20:32:42.1692729Z     ) -> None:
2025-05-07T20:32:42.1692830Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1692901Z     
2025-05-07T20:32:42.1693073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1693154Z     
2025-05-07T20:32:42.1693247Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1693369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1693463Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1693538Z         x0 = x[:, :D]
2025-05-07T20:32:42.1693667Z         x1 = x[:, D:]
2025-05-07T20:32:42.1693740Z     
2025-05-07T20:32:42.1693818Z         if contiguous:
2025-05-07T20:32:42.1693916Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1694004Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1694078Z     
2025-05-07T20:32:42.1694173Z         if scale_ub is not None:
2025-05-07T20:32:42.1694276Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1694408Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1694486Z             )
2025-05-07T20:32:42.1694560Z         else:
2025-05-07T20:32:42.1694653Z             scale_ub_tensor = None
2025-05-07T20:32:42.1694730Z     
2025-05-07T20:32:42.1694864Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1694951Z             op = silu_mul_quant
2025-05-07T20:32:42.1695044Z             if compiled:
2025-05-07T20:32:42.1695147Z                 op = torch.compile(op)
2025-05-07T20:32:42.1695263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1695342Z     
2025-05-07T20:32:42.1695435Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1695439Z 
2025-05-07T20:32:42.1695543Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1695673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1695775Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1695879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1696252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1696356Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1696860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1696954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1697324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1697594Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1697973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1698079Z     kernel = self.compile(
2025-05-07T20:32:42.1698463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1698646Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1698772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1698777Z 
2025-05-07T20:32:42.1698981Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287a33850>
2025-05-07T20:32:42.1699799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1700305Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28785ee80>}
2025-05-07T20:32:42.1701064Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1701254Z context = <triton._C.libtriton.ir.context object at 0x7fb287aafeb0>
2025-05-07T20:32:42.1701258Z 
2025-05-07T20:32:42.1701429Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1701691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1701799Z                            module_map=module_map)
2025-05-07T20:32:42.1701971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1702070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1702192Z E       ^
2025-05-07T20:32:42.1702555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1702559Z 
2025-05-07T20:32:42.1702974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1702979Z 
2025-05-07T20:32:42.1703087Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1703310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1703387Z     T=1,
2025-05-07T20:32:42.1703471Z     D=7168,
2025-05-07T20:32:42.1703553Z     scale_ub=1200.0,
2025-05-07T20:32:42.1703639Z     contiguous=False,
2025-05-07T20:32:42.1703729Z     compiled=True,
2025-05-07T20:32:42.1703802Z )
2025-05-07T20:32:42.1704026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1704203Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.1704213Z 
2025-05-07T20:32:42.1704289Z     @given(
2025-05-07T20:32:42.1704414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1704509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1704622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1704749Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1704860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1704933Z     )
2025-05-07T20:32:42.1705183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1705276Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1705355Z         self,
2025-05-07T20:32:42.1705437Z         T: int,
2025-05-07T20:32:42.1705514Z         D: int,
2025-05-07T20:32:42.1705614Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1705706Z         contiguous: bool,
2025-05-07T20:32:42.1705832Z         compiled: bool,
2025-05-07T20:32:42.1705918Z     ) -> None:
2025-05-07T20:32:42.1706074Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1706147Z     
2025-05-07T20:32:42.1706320Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1706392Z     
2025-05-07T20:32:42.1706484Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1706612Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1706700Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1706777Z         x0 = x[:, :D]
2025-05-07T20:32:42.1706864Z         x1 = x[:, D:]
2025-05-07T20:32:42.1706938Z     
2025-05-07T20:32:42.1707018Z         if contiguous:
2025-05-07T20:32:42.1707115Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1707201Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1707280Z     
2025-05-07T20:32:42.1707370Z         if scale_ub is not None:
2025-05-07T20:32:42.1707518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1707660Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1707738Z             )
2025-05-07T20:32:42.1707818Z         else:
2025-05-07T20:32:42.1707918Z             scale_ub_tensor = None
2025-05-07T20:32:42.1707989Z     
2025-05-07T20:32:42.1708119Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1708218Z             op = silu_mul_quant
2025-05-07T20:32:42.1708302Z             if compiled:
2025-05-07T20:32:42.1708400Z                 op = torch.compile(op)
2025-05-07T20:32:42.1708510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1708581Z     
2025-05-07T20:32:42.1708677Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1708681Z 
2025-05-07T20:32:42.1708775Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1708901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1709006Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1709105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1709481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1709626Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1710124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1710227Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1710590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1710811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1711163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1711256Z     kernel = self.compile(
2025-05-07T20:32:42.1711646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1711830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1711960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1711964Z 
2025-05-07T20:32:42.1712180Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287a39550>
2025-05-07T20:32:42.1712952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1713458Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287af4680>}
2025-05-07T20:32:42.1714269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1719501Z context = <triton._C.libtriton.ir.context object at 0x7fb287a29bb0>
2025-05-07T20:32:42.1719591Z 
2025-05-07T20:32:42.1719782Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1720060Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1720168Z                            module_map=module_map)
2025-05-07T20:32:42.1720332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1720437Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1720516Z E       ^
2025-05-07T20:32:42.1720877Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1720889Z 
2025-05-07T20:32:42.1721352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1721358Z 
2025-05-07T20:32:42.1721465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1721704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1721785Z     T=1,
2025-05-07T20:32:42.1721862Z     D=7168,
2025-05-07T20:32:42.1721951Z     scale_ub=None,
2025-05-07T20:32:42.1722039Z     contiguous=False,
2025-05-07T20:32:42.1722122Z     compiled=True,
2025-05-07T20:32:42.1722203Z )
2025-05-07T20:32:42.1722420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1722594Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1722598Z 
2025-05-07T20:32:42.1722677Z     @given(
2025-05-07T20:32:42.1722798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1722904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1723017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1723139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1723262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1723392Z     )
2025-05-07T20:32:42.1723746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1723856Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1723934Z         self,
2025-05-07T20:32:42.1724020Z         T: int,
2025-05-07T20:32:42.1724098Z         D: int,
2025-05-07T20:32:42.1724196Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1724298Z         contiguous: bool,
2025-05-07T20:32:42.1724384Z         compiled: bool,
2025-05-07T20:32:42.1724464Z     ) -> None:
2025-05-07T20:32:42.1724569Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1724643Z     
2025-05-07T20:32:42.1724818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1724903Z     
2025-05-07T20:32:42.1724997Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1725124Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1725224Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1725307Z         x0 = x[:, :D]
2025-05-07T20:32:42.1725404Z         x1 = x[:, D:]
2025-05-07T20:32:42.1725477Z     
2025-05-07T20:32:42.1725562Z         if contiguous:
2025-05-07T20:32:42.1725664Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1725760Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1725832Z     
2025-05-07T20:32:42.1725933Z         if scale_ub is not None:
2025-05-07T20:32:42.1726039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1726175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1726263Z             )
2025-05-07T20:32:42.1726341Z         else:
2025-05-07T20:32:42.1726436Z             scale_ub_tensor = None
2025-05-07T20:32:42.1726516Z     
2025-05-07T20:32:42.1726645Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1726742Z             op = silu_mul_quant
2025-05-07T20:32:42.1726831Z             if compiled:
2025-05-07T20:32:42.1726980Z                 op = torch.compile(op)
2025-05-07T20:32:42.1727098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1727213Z     
2025-05-07T20:32:42.1727304Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1727434Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1727506Z     
2025-05-07T20:32:42.1727642Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1727751Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1727851Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1727971Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1728116Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1728191Z     
2025-05-07T20:32:42.1728297Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1728301Z 
2025-05-07T20:32:42.1728398Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1728569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1728684Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1728819Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1729379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1729485Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1729845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1730082Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1730450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1730708Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1731118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.1731415Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1731800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1731964Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1732307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1732389Z     fn()
2025-05-07T20:32:42.1732791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1732874Z     self.fn.run(
2025-05-07T20:32:42.1733223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1733317Z     kernel = self.compile(
2025-05-07T20:32:42.1733709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1733891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1734019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1734024Z 
2025-05-07T20:32:42.1734236Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287c63cd0>
2025-05-07T20:32:42.1735008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1735514Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287af5580>}
2025-05-07T20:32:42.1736309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1736546Z context = <triton._C.libtriton.ir.context object at 0x7fb287c40330>
2025-05-07T20:32:42.1736560Z 
2025-05-07T20:32:42.1736726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1736993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1737108Z                            module_map=module_map)
2025-05-07T20:32:42.1737267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1737367Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1737452Z E       ^
2025-05-07T20:32:42.1737806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1737811Z 
2025-05-07T20:32:42.1738277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1738286Z 
2025-05-07T20:32:42.1738682Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1738998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1739125Z     T=1,
2025-05-07T20:32:42.1739233Z     D=5120,
2025-05-07T20:32:42.1739349Z     scale_ub=1200.0,
2025-05-07T20:32:42.1739458Z     contiguous=False,
2025-05-07T20:32:42.1739541Z     compiled=True,
2025-05-07T20:32:42.1739615Z )
2025-05-07T20:32:42.1739842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1740010Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.1740015Z 
2025-05-07T20:32:42.1740103Z     @given(
2025-05-07T20:32:42.1740222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1740319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1740446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1740713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1740827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1740910Z     )
2025-05-07T20:32:42.1741156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1741247Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1741334Z         self,
2025-05-07T20:32:42.1741415Z         T: int,
2025-05-07T20:32:42.1741500Z         D: int,
2025-05-07T20:32:42.1741598Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1741685Z         contiguous: bool,
2025-05-07T20:32:42.1741777Z         compiled: bool,
2025-05-07T20:32:42.1741855Z     ) -> None:
2025-05-07T20:32:42.1741947Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1742022Z     
2025-05-07T20:32:42.1742191Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1742264Z     
2025-05-07T20:32:42.1742364Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1742489Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1742581Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1742669Z         x0 = x[:, :D]
2025-05-07T20:32:42.1742746Z         x1 = x[:, D:]
2025-05-07T20:32:42.1742828Z     
2025-05-07T20:32:42.1742909Z         if contiguous:
2025-05-07T20:32:42.1743000Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1743101Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1743176Z     
2025-05-07T20:32:42.1743265Z         if scale_ub is not None:
2025-05-07T20:32:42.1743376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1743509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1743584Z             )
2025-05-07T20:32:42.1743663Z         else:
2025-05-07T20:32:42.1743759Z             scale_ub_tensor = None
2025-05-07T20:32:42.1743830Z     
2025-05-07T20:32:42.1743967Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1744056Z             op = silu_mul_quant
2025-05-07T20:32:42.1744246Z             if compiled:
2025-05-07T20:32:42.1744412Z                 op = torch.compile(op)
2025-05-07T20:32:42.1744517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1744589Z     
2025-05-07T20:32:42.1744686Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1744691Z 
2025-05-07T20:32:42.1744786Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1744912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1745018Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1745115Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1745492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1745581Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1746143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1746252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1746613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1746835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1747183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1747274Z     kernel = self.compile(
2025-05-07T20:32:42.1747664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1747838Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1747963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1747967Z 
2025-05-07T20:32:42.1748181Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287c3c590>
2025-05-07T20:32:42.1748958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1749511Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287af6b60>}
2025-05-07T20:32:42.1750262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1750449Z context = <triton._C.libtriton.ir.context object at 0x7fb287c98bf0>
2025-05-07T20:32:42.1750460Z 
2025-05-07T20:32:42.1750622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1750889Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1751003Z                            module_map=module_map)
2025-05-07T20:32:42.1751166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1751262Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1751344Z E       ^
2025-05-07T20:32:42.1751698Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1751703Z 
2025-05-07T20:32:42.1752124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1752129Z 
2025-05-07T20:32:42.1752228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1752447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1752529Z     T=1,
2025-05-07T20:32:42.1752603Z     D=5120,
2025-05-07T20:32:42.1752683Z     scale_ub=1200.0,
2025-05-07T20:32:42.1752778Z     contiguous=False,
2025-05-07T20:32:42.1752860Z     compiled=False,
2025-05-07T20:32:42.1752980Z )
2025-05-07T20:32:42.1753253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1753419Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1753424Z 
2025-05-07T20:32:42.1753507Z     @given(
2025-05-07T20:32:42.1753623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1753719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1753838Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1753951Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1754061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1754142Z     )
2025-05-07T20:32:42.1754385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1754475Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1754598Z         self,
2025-05-07T20:32:42.1754672Z         T: int,
2025-05-07T20:32:42.1754757Z         D: int,
2025-05-07T20:32:42.1754855Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1754947Z         contiguous: bool,
2025-05-07T20:32:42.1755040Z         compiled: bool,
2025-05-07T20:32:42.1755117Z     ) -> None:
2025-05-07T20:32:42.1755210Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1755288Z     
2025-05-07T20:32:42.1755458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1755531Z     
2025-05-07T20:32:42.1755630Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1755752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1755839Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1755925Z         x0 = x[:, :D]
2025-05-07T20:32:42.1756005Z         x1 = x[:, D:]
2025-05-07T20:32:42.1756085Z     
2025-05-07T20:32:42.1756169Z         if contiguous:
2025-05-07T20:32:42.1756258Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1756355Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1756427Z     
2025-05-07T20:32:42.1756564Z         if scale_ub is not None:
2025-05-07T20:32:42.1756677Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1756811Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1756891Z             )
2025-05-07T20:32:42.1756973Z         else:
2025-05-07T20:32:42.1757068Z             scale_ub_tensor = None
2025-05-07T20:32:42.1757141Z     
2025-05-07T20:32:42.1757276Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1757367Z             op = silu_mul_quant
2025-05-07T20:32:42.1757449Z             if compiled:
2025-05-07T20:32:42.1757556Z                 op = torch.compile(op)
2025-05-07T20:32:42.1757660Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1757743Z     
2025-05-07T20:32:42.1757832Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1757837Z 
2025-05-07T20:32:42.1757933Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1758073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1758175Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1758276Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1758781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1758876Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1759246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1759465Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1759806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1759907Z     kernel = self.compile(
2025-05-07T20:32:42.1760295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1760511Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1760684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1760688Z 
2025-05-07T20:32:42.1760892Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287ca5d50>
2025-05-07T20:32:42.1761664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1762160Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287af72e0>}
2025-05-07T20:32:42.1762952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1763145Z context = <triton._C.libtriton.ir.context object at 0x7fb287cf6370>
2025-05-07T20:32:42.1763155Z 
2025-05-07T20:32:42.1763317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1763664Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1763770Z                            module_map=module_map)
2025-05-07T20:32:42.1763939Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1764033Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1764112Z E       ^
2025-05-07T20:32:42.1764477Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1764481Z 
2025-05-07T20:32:42.1764897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1764905Z 
2025-05-07T20:32:42.1765007Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1765286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1765366Z     T=16384,
2025-05-07T20:32:42.1765451Z     D=5120,
2025-05-07T20:32:42.1765532Z     scale_ub=1200.0,
2025-05-07T20:32:42.1765617Z     contiguous=False,
2025-05-07T20:32:42.1765704Z     compiled=True,
2025-05-07T20:32:42.1765776Z )
2025-05-07T20:32:42.1765994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1766181Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.1766185Z 
2025-05-07T20:32:42.1766262Z     @given(
2025-05-07T20:32:42.1766381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1766487Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1766600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1766728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1766841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1766916Z     )
2025-05-07T20:32:42.1767168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1767258Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1767335Z         self,
2025-05-07T20:32:42.1767417Z         T: int,
2025-05-07T20:32:42.1767493Z         D: int,
2025-05-07T20:32:42.1767591Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1767685Z         contiguous: bool,
2025-05-07T20:32:42.1767769Z         compiled: bool,
2025-05-07T20:32:42.1767846Z     ) -> None:
2025-05-07T20:32:42.1767946Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1768016Z     
2025-05-07T20:32:42.1768199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1768273Z     
2025-05-07T20:32:42.1768365Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1768495Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1768583Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1768707Z         x0 = x[:, :D]
2025-05-07T20:32:42.1768860Z         x1 = x[:, D:]
2025-05-07T20:32:42.1768940Z     
2025-05-07T20:32:42.1769021Z         if contiguous:
2025-05-07T20:32:42.1769122Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1769210Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1769279Z     
2025-05-07T20:32:42.1769376Z         if scale_ub is not None:
2025-05-07T20:32:42.1769479Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1769619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1769691Z             )
2025-05-07T20:32:42.1769766Z         else:
2025-05-07T20:32:42.1769868Z             scale_ub_tensor = None
2025-05-07T20:32:42.1769939Z     
2025-05-07T20:32:42.1770068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1770165Z             op = silu_mul_quant
2025-05-07T20:32:42.1770295Z             if compiled:
2025-05-07T20:32:42.1770393Z                 op = torch.compile(op)
2025-05-07T20:32:42.1770510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1770586Z     
2025-05-07T20:32:42.1770678Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1770690Z 
2025-05-07T20:32:42.1770787Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1770915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1771024Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1771121Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1771493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1771591Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1772088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1772183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1772555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1772847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1773197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1773290Z     kernel = self.compile(
2025-05-07T20:32:42.1773679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1773860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1773989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1773993Z 
2025-05-07T20:32:42.1774209Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2874051d0>
2025-05-07T20:32:42.1774984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1775486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2874e0fe0>}
2025-05-07T20:32:42.1776244Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1776435Z context = <triton._C.libtriton.ir.context object at 0x7fb2874097f0>
2025-05-07T20:32:42.1776439Z 
2025-05-07T20:32:42.1776613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1776875Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1776981Z                            module_map=module_map)
2025-05-07T20:32:42.1777195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1777334Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1777424Z E       ^
2025-05-07T20:32:42.1777778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1777782Z 
2025-05-07T20:32:42.1778198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1778203Z 
2025-05-07T20:32:42.1778314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1778534Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1778619Z     T=2048,
2025-05-07T20:32:42.1778696Z     D=7168,
2025-05-07T20:32:42.1778777Z     scale_ub=1200.0,
2025-05-07T20:32:42.1778871Z     contiguous=False,
2025-05-07T20:32:42.1778953Z     compiled=True,
2025-05-07T20:32:42.1779025Z )
2025-05-07T20:32:42.1779293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1779470Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.1779478Z 
2025-05-07T20:32:42.1779554Z     @given(
2025-05-07T20:32:42.1779681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1779779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1779890Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1780013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1780126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1780206Z     )
2025-05-07T20:32:42.1780450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1780541Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1780626Z         self,
2025-05-07T20:32:42.1780715Z         T: int,
2025-05-07T20:32:42.1780791Z         D: int,
2025-05-07T20:32:42.1780898Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1780989Z         contiguous: bool,
2025-05-07T20:32:42.1781114Z         compiled: bool,
2025-05-07T20:32:42.1781205Z     ) -> None:
2025-05-07T20:32:42.1781300Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1781372Z     
2025-05-07T20:32:42.1781553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1781629Z     
2025-05-07T20:32:42.1781718Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1781850Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1781938Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1782016Z         x0 = x[:, :D]
2025-05-07T20:32:42.1782104Z         x1 = x[:, D:]
2025-05-07T20:32:42.1782173Z     
2025-05-07T20:32:42.1782264Z         if contiguous:
2025-05-07T20:32:42.1782355Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1782443Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1782523Z     
2025-05-07T20:32:42.1782616Z         if scale_ub is not None:
2025-05-07T20:32:42.1782722Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1782869Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1782947Z             )
2025-05-07T20:32:42.1783024Z         else:
2025-05-07T20:32:42.1783124Z             scale_ub_tensor = None
2025-05-07T20:32:42.1783197Z     
2025-05-07T20:32:42.1783325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1783420Z             op = silu_mul_quant
2025-05-07T20:32:42.1783503Z             if compiled:
2025-05-07T20:32:42.1783607Z                 op = torch.compile(op)
2025-05-07T20:32:42.1783710Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1783781Z     
2025-05-07T20:32:42.1783877Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1783881Z 
2025-05-07T20:32:42.1783977Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1784104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1784215Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1784364Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1784740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1784875Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1785370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1785471Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1785829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1786050Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1786399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1786490Z     kernel = self.compile(
2025-05-07T20:32:42.1786925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1787105Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1787232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1787237Z 
2025-05-07T20:32:42.1787449Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2874926d0>
2025-05-07T20:32:42.1788216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1788719Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2874e1b20>}
2025-05-07T20:32:42.1789475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1789709Z context = <triton._C.libtriton.ir.context object at 0x7fb2874a6cf0>
2025-05-07T20:32:42.1789714Z 
2025-05-07T20:32:42.1789885Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1790148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1790261Z                            module_map=module_map)
2025-05-07T20:32:42.1790421Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1790520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1790600Z E       ^
2025-05-07T20:32:42.1790955Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1790959Z 
2025-05-07T20:32:42.1791376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1791393Z 
2025-05-07T20:32:42.1791492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1791717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1791804Z     T=1,
2025-05-07T20:32:42.1791875Z     D=5120,
2025-05-07T20:32:42.1791957Z     scale_ub=None,
2025-05-07T20:32:42.1792049Z     contiguous=False,
2025-05-07T20:32:42.1792132Z     compiled=False,
2025-05-07T20:32:42.1792204Z )
2025-05-07T20:32:42.1792428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1792592Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.1792597Z 
2025-05-07T20:32:42.1792681Z     @given(
2025-05-07T20:32:42.1792795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1792892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1793014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1793174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1793292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1793417Z     )
2025-05-07T20:32:42.1793658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1793750Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1793838Z         self,
2025-05-07T20:32:42.1793914Z         T: int,
2025-05-07T20:32:42.1793988Z         D: int,
2025-05-07T20:32:42.1794092Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1794182Z         contiguous: bool,
2025-05-07T20:32:42.1794274Z         compiled: bool,
2025-05-07T20:32:42.1794353Z     ) -> None:
2025-05-07T20:32:42.1794450Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1794530Z     
2025-05-07T20:32:42.1794699Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1794772Z     
2025-05-07T20:32:42.1794869Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1795038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1795134Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1795224Z         x0 = x[:, :D]
2025-05-07T20:32:42.1795303Z         x1 = x[:, D:]
2025-05-07T20:32:42.1795376Z     
2025-05-07T20:32:42.1795469Z         if contiguous:
2025-05-07T20:32:42.1795561Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1795653Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1795724Z     
2025-05-07T20:32:42.1795816Z         if scale_ub is not None:
2025-05-07T20:32:42.1795929Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1796063Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1796139Z             )
2025-05-07T20:32:42.1796223Z         else:
2025-05-07T20:32:42.1796317Z             scale_ub_tensor = None
2025-05-07T20:32:42.1796394Z     
2025-05-07T20:32:42.1796531Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1796626Z             op = silu_mul_quant
2025-05-07T20:32:42.1796712Z             if compiled:
2025-05-07T20:32:42.1796863Z                 op = torch.compile(op)
2025-05-07T20:32:42.1796969Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1797048Z     
2025-05-07T20:32:42.1797139Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1797143Z 
2025-05-07T20:32:42.1797237Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1797375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1797473Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1797571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1798087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1798183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1798542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1798781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1799127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1799233Z     kernel = self.compile(
2025-05-07T20:32:42.1799618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1799790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1799927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1799932Z 
2025-05-07T20:32:42.1800136Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8bfc110>
2025-05-07T20:32:42.1800915Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1801453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2874e2e80>}
2025-05-07T20:32:42.1802257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1802446Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8ba4670>
2025-05-07T20:32:42.1802451Z 
2025-05-07T20:32:42.1802615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1802885Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1802989Z                            module_map=module_map)
2025-05-07T20:32:42.1803149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1803324Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1803401Z E       ^
2025-05-07T20:32:42.1803910Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1803919Z 
2025-05-07T20:32:42.1804334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1804339Z 
2025-05-07T20:32:42.1804438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1804666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1804741Z     T=4096,
2025-05-07T20:32:42.1804827Z     D=7168,
2025-05-07T20:32:42.1804908Z     scale_ub=1200.0,
2025-05-07T20:32:42.1804994Z     contiguous=False,
2025-05-07T20:32:42.1805078Z     compiled=False,
2025-05-07T20:32:42.1805153Z )
2025-05-07T20:32:42.1805368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1805555Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1805562Z 
2025-05-07T20:32:42.1805689Z     @given(
2025-05-07T20:32:42.1805809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1805915Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1806029Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1806152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1806264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1806335Z     )
2025-05-07T20:32:42.1806585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1806680Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1806754Z         self,
2025-05-07T20:32:42.1806833Z         T: int,
2025-05-07T20:32:42.1806912Z         D: int,
2025-05-07T20:32:42.1807006Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1807099Z         contiguous: bool,
2025-05-07T20:32:42.1807186Z         compiled: bool,
2025-05-07T20:32:42.1807263Z     ) -> None:
2025-05-07T20:32:42.1807370Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1807464Z     
2025-05-07T20:32:42.1807660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1807740Z     
2025-05-07T20:32:42.1807827Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1807954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1808041Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1808116Z         x0 = x[:, :D]
2025-05-07T20:32:42.1808201Z         x1 = x[:, D:]
2025-05-07T20:32:42.1808272Z     
2025-05-07T20:32:42.1808352Z         if contiguous:
2025-05-07T20:32:42.1808446Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1808533Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1808605Z     
2025-05-07T20:32:42.1808699Z         if scale_ub is not None:
2025-05-07T20:32:42.1808800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1808935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1809016Z             )
2025-05-07T20:32:42.1809136Z         else:
2025-05-07T20:32:42.1809274Z             scale_ub_tensor = None
2025-05-07T20:32:42.1809346Z     
2025-05-07T20:32:42.1809472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1809568Z             op = silu_mul_quant
2025-05-07T20:32:42.1809648Z             if compiled:
2025-05-07T20:32:42.1809747Z                 op = torch.compile(op)
2025-05-07T20:32:42.1809856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1809928Z     
2025-05-07T20:32:42.1810017Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1810021Z 
2025-05-07T20:32:42.1810123Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1810250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1810360Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1810458Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1811018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1811130Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1811492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1811714Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1812064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1812156Z     kernel = self.compile(
2025-05-07T20:32:42.1812550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1812726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1812853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1812857Z 
2025-05-07T20:32:42.1813078Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a8b43050>
2025-05-07T20:32:42.1813890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1814398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8ba8040>}
2025-05-07T20:32:42.1815152Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1815341Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8b93670>
2025-05-07T20:32:42.1815354Z 
2025-05-07T20:32:42.1815519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1815786Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1815905Z                            module_map=module_map)
2025-05-07T20:32:42.1816066Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1816159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1816243Z E       ^
2025-05-07T20:32:42.1816596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1816601Z 
2025-05-07T20:32:42.1817025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1817030Z 
2025-05-07T20:32:42.1817131Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1817351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1817441Z     T=16384,
2025-05-07T20:32:42.1817520Z     D=7168,
2025-05-07T20:32:42.1817597Z     scale_ub=None,
2025-05-07T20:32:42.1817733Z     contiguous=True,
2025-05-07T20:32:42.1817856Z     compiled=True,
2025-05-07T20:32:42.1817930Z )
2025-05-07T20:32:42.1818152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1818324Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1818329Z 
2025-05-07T20:32:42.1818412Z     @given(
2025-05-07T20:32:42.1818528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1818626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1818744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1818859Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1818969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1819050Z     )
2025-05-07T20:32:42.1819294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1819425Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1819511Z         self,
2025-05-07T20:32:42.1819589Z         T: int,
2025-05-07T20:32:42.1819679Z         D: int,
2025-05-07T20:32:42.1819777Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1819865Z         contiguous: bool,
2025-05-07T20:32:42.1819953Z         compiled: bool,
2025-05-07T20:32:42.1820030Z     ) -> None:
2025-05-07T20:32:42.1820123Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1820202Z     
2025-05-07T20:32:42.1820370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1820442Z     
2025-05-07T20:32:42.1820539Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1820661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1820746Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1820829Z         x0 = x[:, :D]
2025-05-07T20:32:42.1820906Z         x1 = x[:, D:]
2025-05-07T20:32:42.1820983Z     
2025-05-07T20:32:42.1821066Z         if contiguous:
2025-05-07T20:32:42.1821156Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1821256Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1821371Z     
2025-05-07T20:32:42.1821460Z         if scale_ub is not None:
2025-05-07T20:32:42.1821572Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1821702Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1821773Z             )
2025-05-07T20:32:42.1821855Z         else:
2025-05-07T20:32:42.1821947Z             scale_ub_tensor = None
2025-05-07T20:32:42.1822017Z     
2025-05-07T20:32:42.1822150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1822241Z             op = silu_mul_quant
2025-05-07T20:32:42.1822323Z             if compiled:
2025-05-07T20:32:42.1822430Z                 op = torch.compile(op)
2025-05-07T20:32:42.1822532Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1822613Z     
2025-05-07T20:32:42.1822702Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1822710Z 
2025-05-07T20:32:42.1822804Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1822946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1823048Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1823144Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1823522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1823612Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1824117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1824210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1824569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1824797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1825190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1825322Z     kernel = self.compile(
2025-05-07T20:32:42.1825715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1825887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1826021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1826025Z 
2025-05-07T20:32:42.1826232Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2873e4110>
2025-05-07T20:32:42.1826999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1827548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8ba9260>}
2025-05-07T20:32:42.1828358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1828557Z context = <triton._C.libtriton.ir.context object at 0x7fb287396730>
2025-05-07T20:32:42.1828561Z 
2025-05-07T20:32:42.1828726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1829001Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1829108Z                            module_map=module_map)
2025-05-07T20:32:42.1829269Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1829377Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1829454Z E       ^
2025-05-07T20:32:42.1829811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1829861Z 
2025-05-07T20:32:42.1830285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1830290Z 
2025-05-07T20:32:42.1830390Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1830619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1830697Z     T=4096,
2025-05-07T20:32:42.1830772Z     D=5120,
2025-05-07T20:32:42.1830862Z     scale_ub=None,
2025-05-07T20:32:42.1830946Z     contiguous=False,
2025-05-07T20:32:42.1831027Z     compiled=True,
2025-05-07T20:32:42.1831114Z )
2025-05-07T20:32:42.1831330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1831500Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1831512Z 
2025-05-07T20:32:42.1831590Z     @given(
2025-05-07T20:32:42.1831709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1831815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1831929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1832043Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1832163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1832236Z     )
2025-05-07T20:32:42.1832479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1832576Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1832649Z         self,
2025-05-07T20:32:42.1832723Z         T: int,
2025-05-07T20:32:42.1832803Z         D: int,
2025-05-07T20:32:42.1832897Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1832986Z         contiguous: bool,
2025-05-07T20:32:42.1833067Z         compiled: bool,
2025-05-07T20:32:42.1833142Z     ) -> None:
2025-05-07T20:32:42.1833241Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1833314Z     
2025-05-07T20:32:42.1833526Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1833674Z     
2025-05-07T20:32:42.1833762Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1833885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1833978Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1834055Z         x0 = x[:, :D]
2025-05-07T20:32:42.1834131Z         x1 = x[:, D:]
2025-05-07T20:32:42.1834210Z     
2025-05-07T20:32:42.1834291Z         if contiguous:
2025-05-07T20:32:42.1834379Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1834474Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1834544Z     
2025-05-07T20:32:42.1834639Z         if scale_ub is not None:
2025-05-07T20:32:42.1834740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1834872Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1834954Z             )
2025-05-07T20:32:42.1835029Z         else:
2025-05-07T20:32:42.1835162Z             scale_ub_tensor = None
2025-05-07T20:32:42.1835246Z     
2025-05-07T20:32:42.1835376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1835465Z             op = silu_mul_quant
2025-05-07T20:32:42.1835552Z             if compiled:
2025-05-07T20:32:42.1835648Z                 op = torch.compile(op)
2025-05-07T20:32:42.1835750Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1835827Z     
2025-05-07T20:32:42.1835917Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1835923Z 
2025-05-07T20:32:42.1836026Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1836151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1836252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1836356Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1836725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1836818Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1837324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1837472Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1837838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1838063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1838806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1838913Z     kernel = self.compile(
2025-05-07T20:32:42.1839300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1839479Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1839605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1839609Z 
2025-05-07T20:32:42.1845116Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287323890>
2025-05-07T20:32:42.1845925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1846438Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8ba9da0>}
2025-05-07T20:32:42.1847188Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1847381Z context = <triton._C.libtriton.ir.context object at 0x7fb28733bef0>
2025-05-07T20:32:42.1847394Z 
2025-05-07T20:32:42.1847779Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1848052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1848239Z                            module_map=module_map)
2025-05-07T20:32:42.1848403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1848507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1848594Z E       ^
2025-05-07T20:32:42.1848950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1848955Z 
2025-05-07T20:32:42.1849384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1849389Z 
2025-05-07T20:32:42.1849497Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1849722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1849881Z     T=4096,
2025-05-07T20:32:42.1849964Z     D=5120,
2025-05-07T20:32:42.1850048Z     scale_ub=1200.0,
2025-05-07T20:32:42.1850147Z     contiguous=False,
2025-05-07T20:32:42.1850235Z     compiled=False,
2025-05-07T20:32:42.1850310Z )
2025-05-07T20:32:42.1850537Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1850711Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1850716Z 
2025-05-07T20:32:42.1850805Z     @given(
2025-05-07T20:32:42.1850923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1851021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1851144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1851260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1851372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1851458Z     )
2025-05-07T20:32:42.1851706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1851811Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1851966Z         self,
2025-05-07T20:32:42.1852045Z         T: int,
2025-05-07T20:32:42.1852130Z         D: int,
2025-05-07T20:32:42.1852228Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1852315Z         contiguous: bool,
2025-05-07T20:32:42.1852411Z         compiled: bool,
2025-05-07T20:32:42.1852495Z     ) -> None:
2025-05-07T20:32:42.1852591Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1852671Z     
2025-05-07T20:32:42.1852842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1852913Z     
2025-05-07T20:32:42.1853013Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1853137Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1853224Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1853312Z         x0 = x[:, :D]
2025-05-07T20:32:42.1853395Z         x1 = x[:, D:]
2025-05-07T20:32:42.1853479Z     
2025-05-07T20:32:42.1853563Z         if contiguous:
2025-05-07T20:32:42.1853665Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1853769Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1853843Z     
2025-05-07T20:32:42.1853934Z         if scale_ub is not None:
2025-05-07T20:32:42.1854047Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1854180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1854255Z             )
2025-05-07T20:32:42.1854339Z         else:
2025-05-07T20:32:42.1854435Z             scale_ub_tensor = None
2025-05-07T20:32:42.1854504Z     
2025-05-07T20:32:42.1854642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1854735Z             op = silu_mul_quant
2025-05-07T20:32:42.1854827Z             if compiled:
2025-05-07T20:32:42.1854927Z                 op = torch.compile(op)
2025-05-07T20:32:42.1855030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1855114Z     
2025-05-07T20:32:42.1855207Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1855212Z 
2025-05-07T20:32:42.1855371Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1855554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1855655Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1855752Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1856260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1856356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1856729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1856952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1857294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1857438Z     kernel = self.compile(
2025-05-07T20:32:42.1857832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1858013Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1858149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1858154Z 
2025-05-07T20:32:42.1858358Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287358b50>
2025-05-07T20:32:42.1859134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1859632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8bab420>}
2025-05-07T20:32:42.1860401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1860642Z context = <triton._C.libtriton.ir.context object at 0x7fb28737d1b0>
2025-05-07T20:32:42.1860647Z 
2025-05-07T20:32:42.1860811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1861081Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1861185Z                            module_map=module_map)
2025-05-07T20:32:42.1861352Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1861452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1861526Z E       ^
2025-05-07T20:32:42.1861891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1861896Z 
2025-05-07T20:32:42.1862314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1862323Z 
2025-05-07T20:32:42.1862434Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1862658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1862736Z     T=4096,
2025-05-07T20:32:42.1862831Z     D=5120,
2025-05-07T20:32:42.1862916Z     scale_ub=1200.0,
2025-05-07T20:32:42.1863001Z     contiguous=False,
2025-05-07T20:32:42.1863092Z     compiled=True,
2025-05-07T20:32:42.1863167Z )
2025-05-07T20:32:42.1863383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1863573Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.1863577Z 
2025-05-07T20:32:42.1863658Z     @given(
2025-05-07T20:32:42.1863781Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1863890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1864051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1864231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1864347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1864426Z     )
2025-05-07T20:32:42.1864688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1864787Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1864866Z         self,
2025-05-07T20:32:42.1864958Z         T: int,
2025-05-07T20:32:42.1865038Z         D: int,
2025-05-07T20:32:42.1865138Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1865240Z         contiguous: bool,
2025-05-07T20:32:42.1865328Z         compiled: bool,
2025-05-07T20:32:42.1865414Z     ) -> None:
2025-05-07T20:32:42.1865508Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1865582Z     
2025-05-07T20:32:42.1865759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1865877Z     
2025-05-07T20:32:42.1865974Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1866108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1866200Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1866280Z         x0 = x[:, :D]
2025-05-07T20:32:42.1866370Z         x1 = x[:, D:]
2025-05-07T20:32:42.1866442Z     
2025-05-07T20:32:42.1866524Z         if contiguous:
2025-05-07T20:32:42.1866623Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1866712Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1866787Z     
2025-05-07T20:32:42.1866885Z         if scale_ub is not None:
2025-05-07T20:32:42.1866990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1867131Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1867209Z             )
2025-05-07T20:32:42.1867681Z         else:
2025-05-07T20:32:42.1867783Z             scale_ub_tensor = None
2025-05-07T20:32:42.1867859Z     
2025-05-07T20:32:42.1867994Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1868094Z             op = silu_mul_quant
2025-05-07T20:32:42.1868229Z             if compiled:
2025-05-07T20:32:42.1868328Z                 op = torch.compile(op)
2025-05-07T20:32:42.1868449Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1868522Z     
2025-05-07T20:32:42.1868616Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1868621Z 
2025-05-07T20:32:42.1868724Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1868851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1868951Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1869052Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1869423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1869518Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1870025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1870125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1870496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1870716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1871057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1871156Z     kernel = self.compile(
2025-05-07T20:32:42.1871540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1871724Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1871850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1871854Z 
2025-05-07T20:32:42.1872060Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2875bfa90>
2025-05-07T20:32:42.1872911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1873453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2875fc860>}
2025-05-07T20:32:42.1874209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1874396Z context = <triton._C.libtriton.ir.context object at 0x7fb28752c130>
2025-05-07T20:32:42.1874401Z 
2025-05-07T20:32:42.1874562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1874880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1874988Z                            module_map=module_map)
2025-05-07T20:32:42.1875159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1875258Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1875331Z E       ^
2025-05-07T20:32:42.1875693Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1875698Z 
2025-05-07T20:32:42.1876110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1876114Z 
2025-05-07T20:32:42.1876228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1876449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1876525Z     T=2048,
2025-05-07T20:32:42.1876612Z     D=7168,
2025-05-07T20:32:42.1876697Z     scale_ub=1200.0,
2025-05-07T20:32:42.1876782Z     contiguous=False,
2025-05-07T20:32:42.1876877Z     compiled=False,
2025-05-07T20:32:42.1877002Z )
2025-05-07T20:32:42.1877217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1877403Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1877407Z 
2025-05-07T20:32:42.1877485Z     @given(
2025-05-07T20:32:42.1877620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1877715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1877829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1877954Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1878063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1878136Z     )
2025-05-07T20:32:42.1878390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1878484Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1878567Z         self,
2025-05-07T20:32:42.1878656Z         T: int,
2025-05-07T20:32:42.1878735Z         D: int,
2025-05-07T20:32:42.1878834Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1878932Z         contiguous: bool,
2025-05-07T20:32:42.1879018Z         compiled: bool,
2025-05-07T20:32:42.1879102Z     ) -> None:
2025-05-07T20:32:42.1879201Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1879274Z     
2025-05-07T20:32:42.1879452Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1879529Z     
2025-05-07T20:32:42.1879626Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1879772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1879865Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1879945Z         x0 = x[:, :D]
2025-05-07T20:32:42.1880037Z         x1 = x[:, D:]
2025-05-07T20:32:42.1880109Z     
2025-05-07T20:32:42.1880196Z         if contiguous:
2025-05-07T20:32:42.1880305Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1880395Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1880539Z     
2025-05-07T20:32:42.1880638Z         if scale_ub is not None:
2025-05-07T20:32:42.1880789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1880939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1881019Z             )
2025-05-07T20:32:42.1881095Z         else:
2025-05-07T20:32:42.1881201Z             scale_ub_tensor = None
2025-05-07T20:32:42.1881277Z     
2025-05-07T20:32:42.1881410Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1881513Z             op = silu_mul_quant
2025-05-07T20:32:42.1881598Z             if compiled:
2025-05-07T20:32:42.1881698Z                 op = torch.compile(op)
2025-05-07T20:32:42.1881815Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1881889Z     
2025-05-07T20:32:42.1881993Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1881998Z 
2025-05-07T20:32:42.1882133Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1882268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1882380Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1882481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1882985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1883090Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1883449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1883828Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1884170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1884264Z     kernel = self.compile(
2025-05-07T20:32:42.1884662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1884839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1885011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1885023Z 
2025-05-07T20:32:42.1885227Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28754cc10>
2025-05-07T20:32:42.1885990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1886495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2875fd6c0>}
2025-05-07T20:32:42.1887248Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1887447Z context = <triton._C.libtriton.ir.context object at 0x7fb2875e9270>
2025-05-07T20:32:42.1887454Z 
2025-05-07T20:32:42.1887616Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1887876Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1887991Z                            module_map=module_map)
2025-05-07T20:32:42.1888151Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1888244Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1888330Z E       ^
2025-05-07T20:32:42.1888682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1888687Z 
2025-05-07T20:32:42.1889107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1889111Z 
2025-05-07T20:32:42.1889262Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1889523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1889605Z     T=1,
2025-05-07T20:32:42.1889679Z     D=7168,
2025-05-07T20:32:42.1889765Z     scale_ub=None,
2025-05-07T20:32:42.1889847Z     contiguous=True,
2025-05-07T20:32:42.1889927Z     compiled=False,
2025-05-07T20:32:42.1890005Z )
2025-05-07T20:32:42.1890222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1890384Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.1890389Z 
2025-05-07T20:32:42.1890474Z     @given(
2025-05-07T20:32:42.1890591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1890687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1890807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1890965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1891088Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1891170Z     )
2025-05-07T20:32:42.1891412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1891510Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1891585Z         self,
2025-05-07T20:32:42.1891659Z         T: int,
2025-05-07T20:32:42.1891740Z         D: int,
2025-05-07T20:32:42.1891836Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1891923Z         contiguous: bool,
2025-05-07T20:32:42.1892014Z         compiled: bool,
2025-05-07T20:32:42.1892090Z     ) -> None:
2025-05-07T20:32:42.1892184Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1892261Z     
2025-05-07T20:32:42.1892429Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1892497Z     
2025-05-07T20:32:42.1892593Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1892717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1892815Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1892938Z         x0 = x[:, :D]
2025-05-07T20:32:42.1893015Z         x1 = x[:, D:]
2025-05-07T20:32:42.1893091Z     
2025-05-07T20:32:42.1893171Z         if contiguous:
2025-05-07T20:32:42.1893265Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1893361Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1893434Z     
2025-05-07T20:32:42.1893524Z         if scale_ub is not None:
2025-05-07T20:32:42.1893636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1893770Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1893844Z             )
2025-05-07T20:32:42.1893925Z         else:
2025-05-07T20:32:42.1894015Z             scale_ub_tensor = None
2025-05-07T20:32:42.1894093Z     
2025-05-07T20:32:42.1894221Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1894310Z             op = silu_mul_quant
2025-05-07T20:32:42.1894400Z             if compiled:
2025-05-07T20:32:42.1894501Z                 op = torch.compile(op)
2025-05-07T20:32:42.1894608Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1894688Z     
2025-05-07T20:32:42.1894778Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1894783Z 
2025-05-07T20:32:42.1894878Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1895012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1895112Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1895218Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1895721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1895816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1896182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1896406Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1896796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1896943Z     kernel = self.compile(
2025-05-07T20:32:42.1897326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1897508Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1897636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1897641Z 
2025-05-07T20:32:42.1897845Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a83f8210>
2025-05-07T20:32:42.1898622Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1899187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2875fcfe0>}
2025-05-07T20:32:42.1899948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1900135Z context = <triton._C.libtriton.ir.context object at 0x7fb3a83dca30>
2025-05-07T20:32:42.1900140Z 
2025-05-07T20:32:42.1900310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1900573Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1900679Z                            module_map=module_map)
2025-05-07T20:32:42.1900845Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1900940Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1901021Z E       ^
2025-05-07T20:32:42.1901383Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1901459Z 
2025-05-07T20:32:42.1901875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1901879Z 
2025-05-07T20:32:42.1901987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1902207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1902287Z     T=16384,
2025-05-07T20:32:42.1902368Z     D=7168,
2025-05-07T20:32:42.1902450Z     scale_ub=1200.0,
2025-05-07T20:32:42.1902540Z     contiguous=False,
2025-05-07T20:32:42.1902627Z     compiled=True,
2025-05-07T20:32:42.1902698Z )
2025-05-07T20:32:42.1902915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1903101Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.1903105Z 
2025-05-07T20:32:42.1903185Z     @given(
2025-05-07T20:32:42.1903312Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1903411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1903524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1903645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1903757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1903829Z     )
2025-05-07T20:32:42.1904079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1904170Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1904245Z         self,
2025-05-07T20:32:42.1904330Z         T: int,
2025-05-07T20:32:42.1904405Z         D: int,
2025-05-07T20:32:42.1904504Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1904592Z         contiguous: bool,
2025-05-07T20:32:42.1904675Z         compiled: bool,
2025-05-07T20:32:42.1904761Z     ) -> None:
2025-05-07T20:32:42.1904852Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1904976Z     
2025-05-07T20:32:42.1905216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1905292Z     
2025-05-07T20:32:42.1905385Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1905527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1905616Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1905696Z         x0 = x[:, :D]
2025-05-07T20:32:42.1905783Z         x1 = x[:, D:]
2025-05-07T20:32:42.1905853Z     
2025-05-07T20:32:42.1905933Z         if contiguous:
2025-05-07T20:32:42.1906035Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1906121Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1906192Z     
2025-05-07T20:32:42.1906289Z         if scale_ub is not None:
2025-05-07T20:32:42.1906391Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1906528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1906657Z             )
2025-05-07T20:32:42.1906733Z         else:
2025-05-07T20:32:42.1906841Z             scale_ub_tensor = None
2025-05-07T20:32:42.1906915Z     
2025-05-07T20:32:42.1907044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1907136Z             op = silu_mul_quant
2025-05-07T20:32:42.1907220Z             if compiled:
2025-05-07T20:32:42.1907318Z                 op = torch.compile(op)
2025-05-07T20:32:42.1907431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1907501Z     
2025-05-07T20:32:42.1907588Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1907592Z 
2025-05-07T20:32:42.1907692Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1907821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1907925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1908025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1908400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1908502Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1909051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1909147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1909517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1909739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1910088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1910181Z     kernel = self.compile(
2025-05-07T20:32:42.1910565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1910749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1910878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1910888Z 
2025-05-07T20:32:42.1911099Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a840b190>
2025-05-07T20:32:42.1911865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1912360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2875ffb00>}
2025-05-07T20:32:42.1913119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1913310Z context = <triton._C.libtriton.ir.context object at 0x7fb3a836b4f0>
2025-05-07T20:32:42.1913355Z 
2025-05-07T20:32:42.1913527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1913840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1913943Z                            module_map=module_map)
2025-05-07T20:32:42.1914111Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1914206Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1914283Z E       ^
2025-05-07T20:32:42.1914644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1914649Z 
2025-05-07T20:32:42.1915060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1915065Z 
2025-05-07T20:32:42.1915169Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1915431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1915511Z     T=1,
2025-05-07T20:32:42.1915596Z     D=7168,
2025-05-07T20:32:42.1915677Z     scale_ub=None,
2025-05-07T20:32:42.1915758Z     contiguous=False,
2025-05-07T20:32:42.1915846Z     compiled=False,
2025-05-07T20:32:42.1915918Z )
2025-05-07T20:32:42.1916139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1916304Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.1916308Z 
2025-05-07T20:32:42.1916386Z     @given(
2025-05-07T20:32:42.1916511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1916608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1916723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1916844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1916955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1917036Z     )
2025-05-07T20:32:42.1917280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1917420Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1917502Z         self,
2025-05-07T20:32:42.1917579Z         T: int,
2025-05-07T20:32:42.1917652Z         D: int,
2025-05-07T20:32:42.1917754Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1917842Z         contiguous: bool,
2025-05-07T20:32:42.1917924Z         compiled: bool,
2025-05-07T20:32:42.1918008Z     ) -> None:
2025-05-07T20:32:42.1918103Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1918175Z     
2025-05-07T20:32:42.1918352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1918424Z     
2025-05-07T20:32:42.1918512Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1918642Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1918730Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1918820Z         x0 = x[:, :D]
2025-05-07T20:32:42.1918897Z         x1 = x[:, D:]
2025-05-07T20:32:42.1918970Z     
2025-05-07T20:32:42.1919061Z         if contiguous:
2025-05-07T20:32:42.1919151Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1919237Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1919319Z     
2025-05-07T20:32:42.1919408Z         if scale_ub is not None:
2025-05-07T20:32:42.1919510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1919650Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1919725Z             )
2025-05-07T20:32:42.1919800Z         else:
2025-05-07T20:32:42.1919901Z             scale_ub_tensor = None
2025-05-07T20:32:42.1919969Z     
2025-05-07T20:32:42.1920107Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1920196Z             op = silu_mul_quant
2025-05-07T20:32:42.1920276Z             if compiled:
2025-05-07T20:32:42.1920381Z                 op = torch.compile(op)
2025-05-07T20:32:42.1920488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1920557Z     
2025-05-07T20:32:42.1920701Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1920745Z 
2025-05-07T20:32:42.1920842Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1920968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1921078Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1921178Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1921687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1921785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1922147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1922376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1922759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1922860Z     kernel = self.compile(
2025-05-07T20:32:42.1923260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1923434Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1923694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1923700Z 
2025-05-07T20:32:42.1923905Z self = <triton.compiler.compiler.ASTSource object at 0x7fb3a83547d0>
2025-05-07T20:32:42.1924671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1925180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a83749a0>}
2025-05-07T20:32:42.1925934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1926177Z context = <triton._C.libtriton.ir.context object at 0x7fb3a8350df0>
2025-05-07T20:32:42.1926181Z 
2025-05-07T20:32:42.1926345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1926612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1926717Z                            module_map=module_map)
2025-05-07T20:32:42.1926881Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1926982Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1927056Z E       ^
2025-05-07T20:32:42.1927412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1927417Z 
2025-05-07T20:32:42.1927839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1927849Z 
2025-05-07T20:32:42.1927951Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1928177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1928254Z     T=2048,
2025-05-07T20:32:42.1928329Z     D=7168,
2025-05-07T20:32:42.1928413Z     scale_ub=None,
2025-05-07T20:32:42.1928498Z     contiguous=False,
2025-05-07T20:32:42.1928576Z     compiled=True,
2025-05-07T20:32:42.1928653Z )
2025-05-07T20:32:42.1928867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1929037Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1929048Z 
2025-05-07T20:32:42.1929122Z     @given(
2025-05-07T20:32:42.1929241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1929389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1929509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1929671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1929790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1929863Z     )
2025-05-07T20:32:42.1930105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1930201Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1930275Z         self,
2025-05-07T20:32:42.1930349Z         T: int,
2025-05-07T20:32:42.1930428Z         D: int,
2025-05-07T20:32:42.1930523Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1930615Z         contiguous: bool,
2025-05-07T20:32:42.1930698Z         compiled: bool,
2025-05-07T20:32:42.1930770Z     ) -> None:
2025-05-07T20:32:42.1930871Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1930943Z     
2025-05-07T20:32:42.1931180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1931266Z     
2025-05-07T20:32:42.1931355Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1931481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1931575Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1931656Z         x0 = x[:, :D]
2025-05-07T20:32:42.1931733Z         x1 = x[:, D:]
2025-05-07T20:32:42.1931807Z     
2025-05-07T20:32:42.1931890Z         if contiguous:
2025-05-07T20:32:42.1931988Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1932073Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1932142Z     
2025-05-07T20:32:42.1932234Z         if scale_ub is not None:
2025-05-07T20:32:42.1932337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1932469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1932548Z             )
2025-05-07T20:32:42.1932623Z         else:
2025-05-07T20:32:42.1932712Z             scale_ub_tensor = None
2025-05-07T20:32:42.1932793Z     
2025-05-07T20:32:42.1932922Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1933057Z             op = silu_mul_quant
2025-05-07T20:32:42.1933148Z             if compiled:
2025-05-07T20:32:42.1933245Z                 op = torch.compile(op)
2025-05-07T20:32:42.1933346Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1933425Z     
2025-05-07T20:32:42.1933512Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1933518Z 
2025-05-07T20:32:42.1933616Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1933743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1933841Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1933946Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1934314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1934404Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1934906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1935009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1935379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1935606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1935949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1936050Z     kernel = self.compile(
2025-05-07T20:32:42.1936434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1936613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1936739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1936746Z 
2025-05-07T20:32:42.1936994Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28772f910>
2025-05-07T20:32:42.1937814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1938312Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8375d00>}
2025-05-07T20:32:42.1939441Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1939633Z context = <triton._C.libtriton.ir.context object at 0x7fb28778bf70>
2025-05-07T20:32:42.1939638Z 
2025-05-07T20:32:42.1939944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1940219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1940331Z                            module_map=module_map)
2025-05-07T20:32:42.1940498Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1940595Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1940673Z E       ^
2025-05-07T20:32:42.1941032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1941037Z 
2025-05-07T20:32:42.1941451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1941455Z 
2025-05-07T20:32:42.1941562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1941784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1941856Z     T=4096,
2025-05-07T20:32:42.1941941Z     D=7168,
2025-05-07T20:32:42.1942025Z     scale_ub=None,
2025-05-07T20:32:42.1942186Z     contiguous=False,
2025-05-07T20:32:42.1942279Z     compiled=True,
2025-05-07T20:32:42.1942351Z )
2025-05-07T20:32:42.1942568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1942748Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1942753Z 
2025-05-07T20:32:42.1942832Z     @given(
2025-05-07T20:32:42.1942960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1943055Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1943167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1943288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1943399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1943471Z     )
2025-05-07T20:32:42.1943727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1943819Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1943900Z         self,
2025-05-07T20:32:42.1943984Z         T: int,
2025-05-07T20:32:42.1944059Z         D: int,
2025-05-07T20:32:42.1944154Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1944245Z         contiguous: bool,
2025-05-07T20:32:42.1944328Z         compiled: bool,
2025-05-07T20:32:42.1944410Z     ) -> None:
2025-05-07T20:32:42.1944503Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1944577Z     
2025-05-07T20:32:42.1944753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1944827Z     
2025-05-07T20:32:42.1944914Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1945043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1945132Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1945211Z         x0 = x[:, :D]
2025-05-07T20:32:42.1945295Z         x1 = x[:, D:]
2025-05-07T20:32:42.1945369Z     
2025-05-07T20:32:42.1945455Z         if contiguous:
2025-05-07T20:32:42.1945552Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1945722Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1945858Z     
2025-05-07T20:32:42.1945953Z         if scale_ub is not None:
2025-05-07T20:32:42.1946054Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1946195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1946271Z             )
2025-05-07T20:32:42.1946345Z         else:
2025-05-07T20:32:42.1946442Z             scale_ub_tensor = None
2025-05-07T20:32:42.1946512Z     
2025-05-07T20:32:42.1946639Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1946733Z             op = silu_mul_quant
2025-05-07T20:32:42.1946815Z             if compiled:
2025-05-07T20:32:42.1946910Z                 op = torch.compile(op)
2025-05-07T20:32:42.1947020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1947092Z     
2025-05-07T20:32:42.1947181Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1947237Z 
2025-05-07T20:32:42.1947337Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1947466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1947574Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1947671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1948042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1948138Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1948633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1948735Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1949096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1949320Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1949670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1949811Z     kernel = self.compile(
2025-05-07T20:32:42.1950194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1950375Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1950501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1950505Z 
2025-05-07T20:32:42.1950721Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2877b90d0>
2025-05-07T20:32:42.1951490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1951992Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb3a8376840>}
2025-05-07T20:32:42.1952757Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1952945Z context = <triton._C.libtriton.ir.context object at 0x7fb287799730>
2025-05-07T20:32:42.1952949Z 
2025-05-07T20:32:42.1953121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1953385Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1953488Z                            module_map=module_map)
2025-05-07T20:32:42.1953656Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1953753Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1953835Z E       ^
2025-05-07T20:32:42.1954236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1954282Z 
2025-05-07T20:32:42.1954697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1954701Z 
2025-05-07T20:32:42.1954812Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1955031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1955116Z     T=16384,
2025-05-07T20:32:42.1955190Z     D=5120,
2025-05-07T20:32:42.1955268Z     scale_ub=1200.0,
2025-05-07T20:32:42.1955360Z     contiguous=False,
2025-05-07T20:32:42.1955446Z     compiled=False,
2025-05-07T20:32:42.1955516Z )
2025-05-07T20:32:42.1955737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1955916Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1955920Z 
2025-05-07T20:32:42.1956037Z     @given(
2025-05-07T20:32:42.1956167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1956270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1956391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1956505Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1956617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1956699Z     )
2025-05-07T20:32:42.1956946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1957036Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1957124Z         self,
2025-05-07T20:32:42.1957200Z         T: int,
2025-05-07T20:32:42.1957273Z         D: int,
2025-05-07T20:32:42.1957379Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1957467Z         contiguous: bool,
2025-05-07T20:32:42.1957550Z         compiled: bool,
2025-05-07T20:32:42.1957637Z     ) -> None:
2025-05-07T20:32:42.1957733Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1957814Z     
2025-05-07T20:32:42.1957986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1958106Z     
2025-05-07T20:32:42.1958201Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1958326Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1958412Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1958494Z         x0 = x[:, :D]
2025-05-07T20:32:42.1958574Z         x1 = x[:, D:]
2025-05-07T20:32:42.1958646Z     
2025-05-07T20:32:42.1958740Z         if contiguous:
2025-05-07T20:32:42.1958828Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1958915Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1958991Z     
2025-05-07T20:32:42.1959078Z         if scale_ub is not None:
2025-05-07T20:32:42.1959184Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1959325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1959402Z             )
2025-05-07T20:32:42.1959493Z         else:
2025-05-07T20:32:42.1959588Z             scale_ub_tensor = None
2025-05-07T20:32:42.1959665Z     
2025-05-07T20:32:42.1959803Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1959894Z             op = silu_mul_quant
2025-05-07T20:32:42.1959976Z             if compiled:
2025-05-07T20:32:42.1960083Z                 op = torch.compile(op)
2025-05-07T20:32:42.1960192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1960269Z     
2025-05-07T20:32:42.1960379Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1960383Z 
2025-05-07T20:32:42.1960484Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1960624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1960725Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1960829Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1961351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1961497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1961867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1962181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1962522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1962624Z     kernel = self.compile(
2025-05-07T20:32:42.1963008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1963181Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1963313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1963318Z 
2025-05-07T20:32:42.1963522Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287f36210>
2025-05-07T20:32:42.1964465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1964970Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287f14040>}
2025-05-07T20:32:42.1965717Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1965917Z context = <triton._C.libtriton.ir.context object at 0x7fb287ffe870>
2025-05-07T20:32:42.1965922Z 
2025-05-07T20:32:42.1966087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1966356Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1966463Z                            module_map=module_map)
2025-05-07T20:32:42.1966669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1966769Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1966844Z E       ^
2025-05-07T20:32:42.1972385Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1972395Z 
2025-05-07T20:32:42.1972838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1972843Z 
2025-05-07T20:32:42.1972957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1973182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1973270Z     T=16384,
2025-05-07T20:32:42.1973350Z     D=5120,
2025-05-07T20:32:42.1973434Z     scale_ub=1200.0,
2025-05-07T20:32:42.1973535Z     contiguous=True,
2025-05-07T20:32:42.1973620Z     compiled=True,
2025-05-07T20:32:42.1973694Z )
2025-05-07T20:32:42.1973928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1974111Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.1974116Z 
2025-05-07T20:32:42.1974193Z     @given(
2025-05-07T20:32:42.1974323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1974424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1974550Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1974667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1974781Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1974865Z     )
2025-05-07T20:32:42.1975115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1975210Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1975297Z         self,
2025-05-07T20:32:42.1975377Z         T: int,
2025-05-07T20:32:42.1975455Z         D: int,
2025-05-07T20:32:42.1975651Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1975783Z         contiguous: bool,
2025-05-07T20:32:42.1975868Z         compiled: bool,
2025-05-07T20:32:42.1975956Z     ) -> None:
2025-05-07T20:32:42.1976052Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1976125Z     
2025-05-07T20:32:42.1976305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1976381Z     
2025-05-07T20:32:42.1976484Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1976612Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1976706Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1976799Z         x0 = x[:, :D]
2025-05-07T20:32:42.1976881Z         x1 = x[:, D:]
2025-05-07T20:32:42.1976953Z     
2025-05-07T20:32:42.1977046Z         if contiguous:
2025-05-07T20:32:42.1977139Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1977276Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1977358Z     
2025-05-07T20:32:42.1977452Z         if scale_ub is not None:
2025-05-07T20:32:42.1977562Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1977703Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1977778Z             )
2025-05-07T20:32:42.1977862Z         else:
2025-05-07T20:32:42.1977957Z             scale_ub_tensor = None
2025-05-07T20:32:42.1978031Z     
2025-05-07T20:32:42.1978167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1978259Z             op = silu_mul_quant
2025-05-07T20:32:42.1978347Z             if compiled:
2025-05-07T20:32:42.1978456Z                 op = torch.compile(op)
2025-05-07T20:32:42.1978562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1978636Z     
2025-05-07T20:32:42.1978735Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1978739Z 
2025-05-07T20:32:42.1978839Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1978979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1979137Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1979241Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1979624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1979719Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1980216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1980322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1980681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1980914Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1981259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1981359Z     kernel = self.compile(
2025-05-07T20:32:42.1981755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1981936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1982065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1982076Z 
2025-05-07T20:32:42.1982284Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287f04410>
2025-05-07T20:32:42.1983058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1983564Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287f15300>}
2025-05-07T20:32:42.1984362Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1984605Z context = <triton._C.libtriton.ir.context object at 0x7fb287f99930>
2025-05-07T20:32:42.1984610Z 
2025-05-07T20:32:42.1984774Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1985038Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1985153Z                            module_map=module_map)
2025-05-07T20:32:42.1985314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1985408Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1985493Z E       ^
2025-05-07T20:32:42.1985848Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1985895Z 
2025-05-07T20:32:42.1986319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1986329Z 
2025-05-07T20:32:42.1986430Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1986652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1986737Z     T=16384,
2025-05-07T20:32:42.1986813Z     D=5120,
2025-05-07T20:32:42.1986899Z     scale_ub=None,
2025-05-07T20:32:42.1986985Z     contiguous=False,
2025-05-07T20:32:42.1987065Z     compiled=True,
2025-05-07T20:32:42.1987146Z )
2025-05-07T20:32:42.1987362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1987537Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1987542Z 
2025-05-07T20:32:42.1987626Z     @given(
2025-05-07T20:32:42.1987744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1987847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1987970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1988134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1988254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1988329Z     )
2025-05-07T20:32:42.1988575Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1988675Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1988750Z         self,
2025-05-07T20:32:42.1988832Z         T: int,
2025-05-07T20:32:42.1988919Z         D: int,
2025-05-07T20:32:42.1989021Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1989111Z         contiguous: bool,
2025-05-07T20:32:42.1989205Z         compiled: bool,
2025-05-07T20:32:42.1989285Z     ) -> None:
2025-05-07T20:32:42.1989381Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1989468Z     
2025-05-07T20:32:42.1989641Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1989715Z     
2025-05-07T20:32:42.1989818Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1989948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1990050Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1990130Z         x0 = x[:, :D]
2025-05-07T20:32:42.1990209Z         x1 = x[:, D:]
2025-05-07T20:32:42.1990297Z     
2025-05-07T20:32:42.1990381Z         if contiguous:
2025-05-07T20:32:42.1990473Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1990574Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1990645Z     
2025-05-07T20:32:42.1990737Z         if scale_ub is not None:
2025-05-07T20:32:42.1990851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1990986Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1991062Z             )
2025-05-07T20:32:42.1991146Z         else:
2025-05-07T20:32:42.1991240Z             scale_ub_tensor = None
2025-05-07T20:32:42.1991322Z     
2025-05-07T20:32:42.1991454Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1991603Z             op = silu_mul_quant
2025-05-07T20:32:42.1992337Z             if compiled:
2025-05-07T20:32:42.1992437Z                 op = torch.compile(op)
2025-05-07T20:32:42.1992540Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1992617Z     
2025-05-07T20:32:42.1992711Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1992716Z 
2025-05-07T20:32:42.1992814Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1992948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1993045Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1993144Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1993512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1993607Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1994158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1994259Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1994622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1994854Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1995195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1995299Z     kernel = self.compile(
2025-05-07T20:32:42.1995684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1995860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1995994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1995998Z 
2025-05-07T20:32:42.1996209Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28765a850>
2025-05-07T20:32:42.1996980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1997564Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287f15e40>}
2025-05-07T20:32:42.1998333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1998532Z context = <triton._C.libtriton.ir.context object at 0x7fb2876aeeb0>
2025-05-07T20:32:42.1998537Z 
2025-05-07T20:32:42.1998699Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1998967Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1999090Z                            module_map=module_map)
2025-05-07T20:32:42.1999250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1999344Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1999431Z E       ^
2025-05-07T20:32:42.1999782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1999786Z 
2025-05-07T20:32:42.2000207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2000212Z 
2025-05-07T20:32:42.2000312Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2000532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2000614Z     T=2048,
2025-05-07T20:32:42.2000688Z     D=5120,
2025-05-07T20:32:42.2000778Z     scale_ub=None,
2025-05-07T20:32:42.2000862Z     contiguous=False,
2025-05-07T20:32:42.2001020Z     compiled=True,
2025-05-07T20:32:42.2001138Z )
2025-05-07T20:32:42.2001358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2001529Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.2001533Z 
2025-05-07T20:32:42.2001618Z     @given(
2025-05-07T20:32:42.2001737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2001834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2001953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2002066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2002184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2002254Z     )
2025-05-07T20:32:42.2002498Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2002640Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2002717Z         self,
2025-05-07T20:32:42.2002794Z         T: int,
2025-05-07T20:32:42.2002877Z         D: int,
2025-05-07T20:32:42.2002974Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2003059Z         contiguous: bool,
2025-05-07T20:32:42.2003150Z         compiled: bool,
2025-05-07T20:32:42.2003227Z     ) -> None:
2025-05-07T20:32:42.2003321Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2003398Z     
2025-05-07T20:32:42.2003741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2003823Z     
2025-05-07T20:32:42.2003913Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2004038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2004132Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2004209Z         x0 = x[:, :D]
2025-05-07T20:32:42.2004284Z         x1 = x[:, D:]
2025-05-07T20:32:42.2004364Z     
2025-05-07T20:32:42.2004444Z         if contiguous:
2025-05-07T20:32:42.2004538Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2004636Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2004758Z     
2025-05-07T20:32:42.2004850Z         if scale_ub is not None:
2025-05-07T20:32:42.2004962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2005095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2005167Z             )
2025-05-07T20:32:42.2005246Z         else:
2025-05-07T20:32:42.2005337Z             scale_ub_tensor = None
2025-05-07T20:32:42.2005413Z     
2025-05-07T20:32:42.2005543Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2005633Z             op = silu_mul_quant
2025-05-07T20:32:42.2005722Z             if compiled:
2025-05-07T20:32:42.2005824Z                 op = torch.compile(op)
2025-05-07T20:32:42.2005926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2006006Z     
2025-05-07T20:32:42.2006096Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2006100Z 
2025-05-07T20:32:42.2006198Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2006340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2006444Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2006551Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2006921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2007012Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2007515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2007613Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2007970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2008204Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2008549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2008701Z     kernel = self.compile(
2025-05-07T20:32:42.2009128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2009304Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2009436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2009440Z 
2025-05-07T20:32:42.2009643Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287617d10>
2025-05-07T20:32:42.2010425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2010963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287f17240>}
2025-05-07T20:32:42.2011718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2011921Z context = <triton._C.libtriton.ir.context object at 0x7fb2876982f0>
2025-05-07T20:32:42.2011926Z 
2025-05-07T20:32:42.2012091Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2012360Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2012465Z                            module_map=module_map)
2025-05-07T20:32:42.2012625Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2012730Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2012809Z E       ^
2025-05-07T20:32:42.2013165Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2013181Z 
2025-05-07T20:32:42.2013637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2013644Z 
2025-05-07T20:32:42.2013745Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2013976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2014057Z     T=2048,
2025-05-07T20:32:42.2014133Z     D=5120,
2025-05-07T20:32:42.2014223Z     scale_ub=1200.0,
2025-05-07T20:32:42.2014309Z     contiguous=False,
2025-05-07T20:32:42.2014391Z     compiled=True,
2025-05-07T20:32:42.2014475Z )
2025-05-07T20:32:42.2014693Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2014874Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.2014879Z 
2025-05-07T20:32:42.2014955Z     @given(
2025-05-07T20:32:42.2015076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2015179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2015295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2015410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2015528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2015601Z     )
2025-05-07T20:32:42.2015853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2015945Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2016020Z         self,
2025-05-07T20:32:42.2016102Z         T: int,
2025-05-07T20:32:42.2016178Z         D: int,
2025-05-07T20:32:42.2016273Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2016369Z         contiguous: bool,
2025-05-07T20:32:42.2016453Z         compiled: bool,
2025-05-07T20:32:42.2016528Z     ) -> None:
2025-05-07T20:32:42.2016628Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2016700Z     
2025-05-07T20:32:42.2016925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2017010Z     
2025-05-07T20:32:42.2017141Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2017263Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2017358Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2017436Z         x0 = x[:, :D]
2025-05-07T20:32:42.2017521Z         x1 = x[:, D:]
2025-05-07T20:32:42.2017594Z     
2025-05-07T20:32:42.2017673Z         if contiguous:
2025-05-07T20:32:42.2017771Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2017857Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2017928Z     
2025-05-07T20:32:42.2018025Z         if scale_ub is not None:
2025-05-07T20:32:42.2018129Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2018260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2018341Z             )
2025-05-07T20:32:42.2018416Z         else:
2025-05-07T20:32:42.2018548Z             scale_ub_tensor = None
2025-05-07T20:32:42.2018628Z     
2025-05-07T20:32:42.2018760Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2018864Z             op = silu_mul_quant
2025-05-07T20:32:42.2018947Z             if compiled:
2025-05-07T20:32:42.2019044Z                 op = torch.compile(op)
2025-05-07T20:32:42.2019157Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2019231Z     
2025-05-07T20:32:42.2019323Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2019327Z 
2025-05-07T20:32:42.2019431Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2019560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2019660Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2019766Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2020133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2020231Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2020736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2020882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2021252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2021474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2021816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2021914Z     kernel = self.compile(
2025-05-07T20:32:42.2022300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2022482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2022607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2022616Z 
2025-05-07T20:32:42.2022823Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28764cd90>
2025-05-07T20:32:42.2023603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2024100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287618720>}
2025-05-07T20:32:42.2024857Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2025046Z context = <triton._C.libtriton.ir.context object at 0x7fb2871813f0>
2025-05-07T20:32:42.2025051Z 
2025-05-07T20:32:42.2025232Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2025536Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2025686Z                            module_map=module_map)
2025-05-07T20:32:42.2025860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2025959Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2026034Z E       ^
2025-05-07T20:32:42.2026397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2026401Z 
2025-05-07T20:32:42.2026817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2026821Z 
2025-05-07T20:32:42.2026931Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2027153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2027267Z     T=4096,
2025-05-07T20:32:42.2027350Z     D=5120,
2025-05-07T20:32:42.2027437Z     scale_ub=1200.0,
2025-05-07T20:32:42.2027527Z     contiguous=True,
2025-05-07T20:32:42.2027616Z     compiled=True,
2025-05-07T20:32:42.2027690Z )
2025-05-07T20:32:42.2027908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2028085Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.2028090Z 
2025-05-07T20:32:42.2028167Z     @given(
2025-05-07T20:32:42.2028291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2028388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2028502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2028625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2028735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2028807Z     )
2025-05-07T20:32:42.2029062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2029156Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2029301Z         self,
2025-05-07T20:32:42.2029379Z         T: int,
2025-05-07T20:32:42.2029455Z         D: int,
2025-05-07T20:32:42.2029557Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2029642Z         contiguous: bool,
2025-05-07T20:32:42.2029727Z         compiled: bool,
2025-05-07T20:32:42.2029808Z     ) -> None:
2025-05-07T20:32:42.2029901Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2029973Z     
2025-05-07T20:32:42.2030148Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2030223Z     
2025-05-07T20:32:42.2030314Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2030444Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2030530Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2030608Z         x0 = x[:, :D]
2025-05-07T20:32:42.2030693Z         x1 = x[:, D:]
2025-05-07T20:32:42.2030763Z     
2025-05-07T20:32:42.2030855Z         if contiguous:
2025-05-07T20:32:42.2030947Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2031040Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2031120Z     
2025-05-07T20:32:42.2031208Z         if scale_ub is not None:
2025-05-07T20:32:42.2031314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2031480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2031554Z             )
2025-05-07T20:32:42.2031629Z         else:
2025-05-07T20:32:42.2031728Z             scale_ub_tensor = None
2025-05-07T20:32:42.2031801Z     
2025-05-07T20:32:42.2031929Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2032024Z             op = silu_mul_quant
2025-05-07T20:32:42.2032107Z             if compiled:
2025-05-07T20:32:42.2032203Z                 op = torch.compile(op)
2025-05-07T20:32:42.2032314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2032385Z     
2025-05-07T20:32:42.2032478Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2032482Z 
2025-05-07T20:32:42.2032632Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2032764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2032912Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2033010Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2033378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2033475Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2033972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2034063Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2034428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2034687Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2035043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2035142Z     kernel = self.compile(
2025-05-07T20:32:42.2035528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2035708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2035835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2035839Z 
2025-05-07T20:32:42.2036051Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28717a210>
2025-05-07T20:32:42.2036820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2037324Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb287619260>}
2025-05-07T20:32:42.2038130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2038322Z context = <triton._C.libtriton.ir.context object at 0x7fb2871fe7b0>
2025-05-07T20:32:42.2038327Z 
2025-05-07T20:32:42.2038855Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2039118Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2039222Z                            module_map=module_map)
2025-05-07T20:32:42.2039390Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2039486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2039569Z E       ^
2025-05-07T20:32:42.2039928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2039939Z 
2025-05-07T20:32:42.2040352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2040356Z 
2025-05-07T20:32:42.2040463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2040685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2040755Z     T=128,
2025-05-07T20:32:42.2040839Z     D=5120,
2025-05-07T20:32:42.2040918Z     scale_ub=1200.0,
2025-05-07T20:32:42.2041009Z     contiguous=False,
2025-05-07T20:32:42.2041091Z     compiled=True,
2025-05-07T20:32:42.2041167Z )
2025-05-07T20:32:42.2041391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2041562Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.2041567Z 
2025-05-07T20:32:42.2041644Z     @given(
2025-05-07T20:32:42.2041970Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2042157Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2042271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2042393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2042506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2042584Z     )
2025-05-07T20:32:42.2042828Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2042924Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2043007Z         self,
2025-05-07T20:32:42.2043082Z         T: int,
2025-05-07T20:32:42.2043156Z         D: int,
2025-05-07T20:32:42.2043257Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2043346Z         contiguous: bool,
2025-05-07T20:32:42.2043430Z         compiled: bool,
2025-05-07T20:32:42.2043515Z     ) -> None:
2025-05-07T20:32:42.2043807Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2043882Z     
2025-05-07T20:32:42.2044061Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2044139Z     
2025-05-07T20:32:42.2044233Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2044358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2044446Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2044529Z         x0 = x[:, :D]
2025-05-07T20:32:42.2044607Z         x1 = x[:, D:]
2025-05-07T20:32:42.2044678Z     
2025-05-07T20:32:42.2044765Z         if contiguous:
2025-05-07T20:32:42.2044852Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2044937Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2045019Z     
2025-05-07T20:32:42.2045109Z         if scale_ub is not None:
2025-05-07T20:32:42.2045210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2045355Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2045430Z             )
2025-05-07T20:32:42.2045516Z         else:
2025-05-07T20:32:42.2045609Z             scale_ub_tensor = None
2025-05-07T20:32:42.2045758Z     
2025-05-07T20:32:42.2045902Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2045988Z             op = silu_mul_quant
2025-05-07T20:32:42.2046072Z             if compiled:
2025-05-07T20:32:42.2046183Z                 op = torch.compile(op)
2025-05-07T20:32:42.2046285Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2046355Z     
2025-05-07T20:32:42.2046454Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2046458Z 
2025-05-07T20:32:42.2046554Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2046684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2046794Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2046893Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2047274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2047364Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2047864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2047970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2048330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2048561Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2048905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2049000Z     kernel = self.compile(
2025-05-07T20:32:42.2049396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2049569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2049698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2049751Z 
2025-05-07T20:32:42.2049966Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28715b490>
2025-05-07T20:32:42.2050779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2051289Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28761a480>}
2025-05-07T20:32:42.2052042Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2052242Z context = <triton._C.libtriton.ir.context object at 0x7fb2871f6470>
2025-05-07T20:32:42.2052285Z 
2025-05-07T20:32:42.2052452Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2052722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2052841Z                            module_map=module_map)
2025-05-07T20:32:42.2053005Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2053103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2053184Z E       ^
2025-05-07T20:32:42.2053537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2053542Z 
2025-05-07T20:32:42.2053965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2053970Z 
2025-05-07T20:32:42.2054072Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2054295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2054380Z     T=16384,
2025-05-07T20:32:42.2054505Z     D=7168,
2025-05-07T20:32:42.2054587Z     scale_ub=1200.0,
2025-05-07T20:32:42.2054678Z     contiguous=True,
2025-05-07T20:32:42.2054758Z     compiled=True,
2025-05-07T20:32:42.2054838Z )
2025-05-07T20:32:42.2055058Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2055232Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.2055237Z 
2025-05-07T20:32:42.2055325Z     @given(
2025-05-07T20:32:42.2055444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2055539Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2055659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2055777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2055889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2055968Z     )
2025-05-07T20:32:42.2056216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2056316Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2056393Z         self,
2025-05-07T20:32:42.2056470Z         T: int,
2025-05-07T20:32:42.2056552Z         D: int,
2025-05-07T20:32:42.2056648Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2056733Z         contiguous: bool,
2025-05-07T20:32:42.2056821Z         compiled: bool,
2025-05-07T20:32:42.2056895Z     ) -> None:
2025-05-07T20:32:42.2056986Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2057066Z     
2025-05-07T20:32:42.2057237Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2057310Z     
2025-05-07T20:32:42.2057405Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2057526Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2057623Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2057699Z         x0 = x[:, :D]
2025-05-07T20:32:42.2057777Z         x1 = x[:, D:]
2025-05-07T20:32:42.2057856Z     
2025-05-07T20:32:42.2057984Z         if contiguous:
2025-05-07T20:32:42.2058078Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2058219Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2058292Z     
2025-05-07T20:32:42.2058377Z         if scale_ub is not None:
2025-05-07T20:32:42.2058487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2058618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2058693Z             )
2025-05-07T20:32:42.2058775Z         else:
2025-05-07T20:32:42.2058865Z             scale_ub_tensor = None
2025-05-07T20:32:42.2058944Z     
2025-05-07T20:32:42.2059075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2059163Z             op = silu_mul_quant
2025-05-07T20:32:42.2059252Z             if compiled:
2025-05-07T20:32:42.2059347Z                 op = torch.compile(op)
2025-05-07T20:32:42.2059451Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2059595Z     
2025-05-07T20:32:42.2059689Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2059695Z 
2025-05-07T20:32:42.2059793Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2059928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2060028Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2060134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2060503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2060594Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2061103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2061198Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2061557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2061788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2062223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2062327Z     kernel = self.compile(
2025-05-07T20:32:42.2062713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2062889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2063021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2063025Z 
2025-05-07T20:32:42.2063231Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2872a0850>
2025-05-07T20:32:42.2064012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2064512Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb28761bd80>}
2025-05-07T20:32:42.2065267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2065466Z context = <triton._C.libtriton.ir.context object at 0x7fb287274eb0>
2025-05-07T20:32:42.2065470Z 
2025-05-07T20:32:42.2065635Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2065905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2066009Z                            module_map=module_map)
2025-05-07T20:32:42.2066171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2066277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2066352Z E       ^
2025-05-07T20:32:42.2066753Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2066809Z 
2025-05-07T20:32:42.2067225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2067230Z 
2025-05-07T20:32:42.2067331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2067562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2067641Z     T=16384,
2025-05-07T20:32:42.2067717Z     D=5120,
2025-05-07T20:32:42.2067803Z     scale_ub=1200.0,
2025-05-07T20:32:42.2067886Z     contiguous=True,
2025-05-07T20:32:42.2067970Z     compiled=False,
2025-05-07T20:32:42.2068045Z )
2025-05-07T20:32:42.2068265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2068486Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2068492Z 
2025-05-07T20:32:42.2068572Z     @given(
2025-05-07T20:32:42.2068694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2068800Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2068913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2069026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2069146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2069219Z     )
2025-05-07T20:32:42.2069465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2069564Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2069639Z         self,
2025-05-07T20:32:42.2069723Z         T: int,
2025-05-07T20:32:42.2069796Z         D: int,
2025-05-07T20:32:42.2069888Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2069978Z         contiguous: bool,
2025-05-07T20:32:42.2070059Z         compiled: bool,
2025-05-07T20:32:42.2070136Z     ) -> None:
2025-05-07T20:32:42.2070238Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2070361Z     
2025-05-07T20:32:42.2070533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2070614Z     
2025-05-07T20:32:42.2070712Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2070835Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2070931Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2071007Z         x0 = x[:, :D]
2025-05-07T20:32:42.2071090Z         x1 = x[:, D:]
2025-05-07T20:32:42.2071163Z     
2025-05-07T20:32:42.2071242Z         if contiguous:
2025-05-07T20:32:42.2071335Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2071421Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2071490Z     
2025-05-07T20:32:42.2071584Z         if scale_ub is not None:
2025-05-07T20:32:42.2071687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2071823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2071904Z             )
2025-05-07T20:32:42.2071980Z         else:
2025-05-07T20:32:42.2072073Z             scale_ub_tensor = None
2025-05-07T20:32:42.2072154Z     
2025-05-07T20:32:42.2072282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2072376Z             op = silu_mul_quant
2025-05-07T20:32:42.2072459Z             if compiled:
2025-05-07T20:32:42.2072556Z                 op = torch.compile(op)
2025-05-07T20:32:42.2072667Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2072742Z     
2025-05-07T20:32:42.2072830Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2072834Z 
2025-05-07T20:32:42.2072932Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2073057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2073156Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2073258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2073809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2073951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2074310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2074531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2074878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2074969Z     kernel = self.compile(
2025-05-07T20:32:42.2075356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2075541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2075665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2075670Z 
2025-05-07T20:32:42.2075923Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28725da90>
2025-05-07T20:32:42.2076702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2077209Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2872c8cc0>}
2025-05-07T20:32:42.2077959Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2078149Z context = <triton._C.libtriton.ir.context object at 0x7fb2872320f0>
2025-05-07T20:32:42.2078153Z 
2025-05-07T20:32:42.2078325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2078596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2078749Z                            module_map=module_map)
2025-05-07T20:32:42.2078911Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2079005Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2079086Z E       ^
2025-05-07T20:32:42.2079442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2079446Z 
2025-05-07T20:32:42.2079866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2079877Z 
2025-05-07T20:32:42.2079979Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2080201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2080287Z     T=1,
2025-05-07T20:32:42.2080361Z     D=7168,
2025-05-07T20:32:42.2080445Z     scale_ub=1200.0,
2025-05-07T20:32:42.2080538Z     contiguous=False,
2025-05-07T20:32:42.2080622Z     compiled=False,
2025-05-07T20:32:42.2080702Z )
2025-05-07T20:32:42.2080925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2081093Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.2081098Z 
2025-05-07T20:32:42.2081176Z     @given(
2025-05-07T20:32:42.2081303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2081405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2081525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2081640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2081750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2081831Z     )
2025-05-07T20:32:42.2082075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2082171Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2082256Z         self,
2025-05-07T20:32:42.2082380Z         T: int,
2025-05-07T20:32:42.2082496Z         D: int,
2025-05-07T20:32:42.2082603Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2082688Z         contiguous: bool,
2025-05-07T20:32:42.2082788Z         compiled: bool,
2025-05-07T20:32:42.2082865Z     ) -> None:
2025-05-07T20:32:42.2082960Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2083038Z     
2025-05-07T20:32:42.2083208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2083281Z     
2025-05-07T20:32:42.2083381Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2083504Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2083689Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2083772Z         x0 = x[:, :D]
2025-05-07T20:32:42.2083850Z         x1 = x[:, D:]
2025-05-07T20:32:42.2083922Z     
2025-05-07T20:32:42.2084012Z         if contiguous:
2025-05-07T20:32:42.2084146Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2084239Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2084322Z     
2025-05-07T20:32:42.2084415Z         if scale_ub is not None:
2025-05-07T20:32:42.2084525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2084657Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2084728Z             )
2025-05-07T20:32:42.2084813Z         else:
2025-05-07T20:32:42.2084904Z             scale_ub_tensor = None
2025-05-07T20:32:42.2084976Z     
2025-05-07T20:32:42.2085111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2085203Z             op = silu_mul_quant
2025-05-07T20:32:42.2085283Z             if compiled:
2025-05-07T20:32:42.2085386Z                 op = torch.compile(op)
2025-05-07T20:32:42.2085490Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2085561Z     
2025-05-07T20:32:42.2085656Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2085660Z 
2025-05-07T20:32:42.2085760Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2085899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2086045Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2086144Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2086658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2086755Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2087120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2087351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2087693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2087794Z     kernel = self.compile(
2025-05-07T20:32:42.2088183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2088359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2088498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2088502Z 
2025-05-07T20:32:42.2088706Z self = <triton.compiler.compiler.ASTSource object at 0x7fb28703d850>
2025-05-07T20:32:42.2089486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2089981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2872c9080>}
2025-05-07T20:32:42.2090779Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2091043Z context = <triton._C.libtriton.ir.context object at 0x7fb2870a9370>
2025-05-07T20:32:42.2091047Z 
2025-05-07T20:32:42.2091213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2091481Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2091584Z                            module_map=module_map)
2025-05-07T20:32:42.2091744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2091845Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2091919Z E       ^
2025-05-07T20:32:42.2092279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2092283Z 
2025-05-07T20:32:42.2092738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2092746Z 
2025-05-07T20:32:42.2092848Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2093079Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2093157Z     T=4096,
2025-05-07T20:32:42.2093233Z     D=7168,
2025-05-07T20:32:42.2093321Z     scale_ub=1200.0,
2025-05-07T20:32:42.2093403Z     contiguous=False,
2025-05-07T20:32:42.2093489Z     compiled=True,
2025-05-07T20:32:42.2093563Z )
2025-05-07T20:32:42.2093779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2099253Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.2099266Z 
2025-05-07T20:32:42.2099366Z     @given(
2025-05-07T20:32:42.2099503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2099605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2099740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2099861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2100058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2100146Z     )
2025-05-07T20:32:42.2100398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2100495Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2100582Z         self,
2025-05-07T20:32:42.2100663Z         T: int,
2025-05-07T20:32:42.2100740Z         D: int,
2025-05-07T20:32:42.2100846Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2100937Z         contiguous: bool,
2025-05-07T20:32:42.2101023Z         compiled: bool,
2025-05-07T20:32:42.2101112Z     ) -> None:
2025-05-07T20:32:42.2101210Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2101298Z     
2025-05-07T20:32:42.2101473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2101549Z     
2025-05-07T20:32:42.2101658Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2101787Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2101883Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2101976Z         x0 = x[:, :D]
2025-05-07T20:32:42.2102054Z         x1 = x[:, D:]
2025-05-07T20:32:42.2102127Z     
2025-05-07T20:32:42.2102219Z         if contiguous:
2025-05-07T20:32:42.2102312Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2102403Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2102487Z     
2025-05-07T20:32:42.2102580Z         if scale_ub is not None:
2025-05-07T20:32:42.2102696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2102836Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2102911Z             )
2025-05-07T20:32:42.2103004Z         else:
2025-05-07T20:32:42.2103099Z             scale_ub_tensor = None
2025-05-07T20:32:42.2103175Z     
2025-05-07T20:32:42.2103315Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2103407Z             op = silu_mul_quant
2025-05-07T20:32:42.2103491Z             if compiled:
2025-05-07T20:32:42.2103652Z                 op = torch.compile(op)
2025-05-07T20:32:42.2103799Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2103873Z     
2025-05-07T20:32:42.2103974Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2103979Z 
2025-05-07T20:32:42.2104077Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2104219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2104323Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2104422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2104806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2104899Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2105395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2105541Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2105909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2106147Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2106493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2106587Z     kernel = self.compile(
2025-05-07T20:32:42.2106982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2107161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2107290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2107304Z 
2025-05-07T20:32:42.2107510Z self = <triton.compiler.compiler.ASTSource object at 0x7fb287083b10>
2025-05-07T20:32:42.2108289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2108844Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb2872cb060>}
2025-05-07T20:32:42.2109594Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2109790Z context = <triton._C.libtriton.ir.context object at 0x7fb287038170>
2025-05-07T20:32:42.2109795Z 
2025-05-07T20:32:42.2109960Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2110224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2110342Z                            module_map=module_map)
2025-05-07T20:32:42.2110509Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2110617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2110694Z E       ^
2025-05-07T20:32:42.2111052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2111057Z 
2025-05-07T20:32:42.2111481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2111486Z 
2025-05-07T20:32:42.2111587Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2111809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2111898Z     T=128,
2025-05-07T20:32:42.2111971Z     D=7168,
2025-05-07T20:32:42.2112065Z     scale_ub=1200.0,
2025-05-07T20:32:42.2112149Z     contiguous=False,
2025-05-07T20:32:42.2112231Z     compiled=True,
2025-05-07T20:32:42.2112309Z )
2025-05-07T20:32:42.2112567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2112779Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.2112783Z 
2025-05-07T20:32:42.2112867Z     @given(
2025-05-07T20:32:42.2112987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2113082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2113205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2113321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2113440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2113516Z     )
2025-05-07T20:32:42.2113760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2113861Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2113937Z         self,
2025-05-07T20:32:42.2114013Z         T: int,
2025-05-07T20:32:42.2114136Z         D: int,
2025-05-07T20:32:42.2114241Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2114335Z         contiguous: bool,
2025-05-07T20:32:42.2114431Z         compiled: bool,
2025-05-07T20:32:42.2114513Z     ) -> None:
2025-05-07T20:32:42.2114607Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2114687Z     
2025-05-07T20:32:42.2114857Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2114943Z     
2025-05-07T20:32:42.2115034Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2115157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2115255Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2115333Z         x0 = x[:, :D]
2025-05-07T20:32:42.2115413Z         x1 = x[:, D:]
2025-05-07T20:32:42.2115501Z     
2025-05-07T20:32:42.2115585Z         if contiguous:
2025-05-07T20:32:42.2115676Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2115774Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2115850Z     
2025-05-07T20:32:42.2115941Z         if scale_ub is not None:
2025-05-07T20:32:42.2116063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2116242Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2116329Z             )
2025-05-07T20:32:42.2116409Z         else:
2025-05-07T20:32:42.2116502Z             scale_ub_tensor = None
2025-05-07T20:32:42.2116583Z     
2025-05-07T20:32:42.2116711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2116800Z             op = silu_mul_quant
2025-05-07T20:32:42.2116894Z             if compiled:
2025-05-07T20:32:42.2116993Z                 op = torch.compile(op)
2025-05-07T20:32:42.2117097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2117178Z     
2025-05-07T20:32:42.2117273Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2117277Z 
2025-05-07T20:32:42.2117375Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2117515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2117617Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2117731Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2118101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2118192Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2118698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2118797Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2119154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2119386Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2119728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2119836Z     kernel = self.compile(
2025-05-07T20:32:42.2120265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2120481Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2120615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2120619Z 
2025-05-07T20:32:42.2120824Z self = <triton.compiler.compiler.ASTSource object at 0x7fb2870d4ed0>
2025-05-07T20:32:42.2121605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2122103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286e48360>}
2025-05-07T20:32:42.2122897Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2123102Z context = <triton._C.libtriton.ir.context object at 0x7fb286edd4b0>
2025-05-07T20:32:42.2123107Z 
2025-05-07T20:32:42.2123270Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2123674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2123782Z                            module_map=module_map)
2025-05-07T20:32:42.2123943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2124046Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2124120Z E       ^
2025-05-07T20:32:42.2124479Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2124483Z 
2025-05-07T20:32:42.2124911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2124963Z 
2025-05-07T20:32:42.2125065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2125288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2125372Z     T=2048,
2025-05-07T20:32:42.2125448Z     D=7168,
2025-05-07T20:32:42.2125536Z     scale_ub=None,
2025-05-07T20:32:42.2125619Z     contiguous=True,
2025-05-07T20:32:42.2125698Z     compiled=True,
2025-05-07T20:32:42.2125777Z )
2025-05-07T20:32:42.2125993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2126160Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.2126165Z 
2025-05-07T20:32:42.2126246Z     @given(
2025-05-07T20:32:42.2126362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2126457Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2126582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2126701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2126819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2126894Z     )
2025-05-07T20:32:42.2127137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2127237Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2127314Z         self,
2025-05-07T20:32:42.2127388Z         T: int,
2025-05-07T20:32:42.2127467Z         D: int,
2025-05-07T20:32:42.2127565Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2127651Z         contiguous: bool,
2025-05-07T20:32:42.2127742Z         compiled: bool,
2025-05-07T20:32:42.2127819Z     ) -> None:
2025-05-07T20:32:42.2127913Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2127989Z     
2025-05-07T20:32:42.2128158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2128236Z     
2025-05-07T20:32:42.2128326Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2128537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2128678Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2128756Z         x0 = x[:, :D]
2025-05-07T20:32:42.2128834Z         x1 = x[:, D:]
2025-05-07T20:32:42.2128910Z     
2025-05-07T20:32:42.2128990Z         if contiguous:
2025-05-07T20:32:42.2129081Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2129175Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2129250Z     
2025-05-07T20:32:42.2129338Z         if scale_ub is not None:
2025-05-07T20:32:42.2129450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2129583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2129664Z             )
2025-05-07T20:32:42.2129740Z         else:
2025-05-07T20:32:42.2129834Z             scale_ub_tensor = None
2025-05-07T20:32:42.2129915Z     
2025-05-07T20:32:42.2130087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2130178Z             op = silu_mul_quant
2025-05-07T20:32:42.2130272Z             if compiled:
2025-05-07T20:32:42.2130374Z                 op = torch.compile(op)
2025-05-07T20:32:42.2130477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2130554Z     
2025-05-07T20:32:42.2130642Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2130647Z 
2025-05-07T20:32:42.2130743Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2130878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2130978Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2131085Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2131459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2131550Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2132058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2132157Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2132558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2132788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2133130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2133235Z     kernel = self.compile(
2025-05-07T20:32:42.2133619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2133794Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2133933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2133938Z 
2025-05-07T20:32:42.2134143Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286eee490>
2025-05-07T20:32:42.2134919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2135424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286e48ea0>}
2025-05-07T20:32:42.2136173Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2136369Z context = <triton._C.libtriton.ir.context object at 0x7fb286e82af0>
2025-05-07T20:32:42.2136373Z 
2025-05-07T20:32:42.2136539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2136809Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2136955Z                            module_map=module_map)
2025-05-07T20:32:42.2137154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2137258Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2137335Z E       ^
2025-05-07T20:32:42.2137701Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2137706Z 
2025-05-07T20:32:42.2138122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2138126Z 
2025-05-07T20:32:42.2138228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2138872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2138952Z     T=16384,
2025-05-07T20:32:42.2139023Z     D=5120,
2025-05-07T20:32:42.2139108Z     scale_ub=None,
2025-05-07T20:32:42.2139337Z     contiguous=False,
2025-05-07T20:32:42.2139433Z     compiled=False,
2025-05-07T20:32:42.2139512Z )
2025-05-07T20:32:42.2139733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2139915Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.2139919Z 
2025-05-07T20:32:42.2139995Z     @given(
2025-05-07T20:32:42.2140115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2140219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2140332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2140447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2140567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2140641Z     )
2025-05-07T20:32:42.2140893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2140986Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2141064Z         self,
2025-05-07T20:32:42.2141143Z         T: int,
2025-05-07T20:32:42.2141222Z         D: int,
2025-05-07T20:32:42.2141387Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2141483Z         contiguous: bool,
2025-05-07T20:32:42.2141566Z         compiled: bool,
2025-05-07T20:32:42.2141642Z     ) -> None:
2025-05-07T20:32:42.2141740Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2141813Z     
2025-05-07T20:32:42.2141980Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2142059Z     
2025-05-07T20:32:42.2142147Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2142275Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2144092Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2144104Z 
2025-05-07T20:32:42.2144227Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.2144232Z 
2025-05-07T20:32:42.2144332Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2144553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2144640Z     T=4096,
2025-05-07T20:32:42.2144715Z     D=7168,
2025-05-07T20:32:42.2144798Z     scale_ub=1200.0,
2025-05-07T20:32:42.2144887Z     contiguous=True,
2025-05-07T20:32:42.2144968Z     compiled=True,
2025-05-07T20:32:42.2145044Z )
2025-05-07T20:32:42.2145268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2145440Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.2145445Z 
2025-05-07T20:32:42.2145603Z     @given(
2025-05-07T20:32:42.2145727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2145884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2146003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2146116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2146228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2146310Z     )
2025-05-07T20:32:42.2146554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2146651Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2146726Z         self,
2025-05-07T20:32:42.2146798Z         T: int,
2025-05-07T20:32:42.2146884Z         D: int,
2025-05-07T20:32:42.2146981Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2147068Z         contiguous: bool,
2025-05-07T20:32:42.2147159Z         compiled: bool,
2025-05-07T20:32:42.2147279Z     ) -> None:
2025-05-07T20:32:42.2147374Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2147453Z     
2025-05-07T20:32:42.2147622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2147696Z     
2025-05-07T20:32:42.2147794Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2147917Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2149707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2149717Z 
2025-05-07T20:32:42.2149835Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.2149880Z 
2025-05-07T20:32:42.2149992Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2150211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2150290Z     T=16384,
2025-05-07T20:32:42.2150374Z     D=7168,
2025-05-07T20:32:42.2150454Z     scale_ub=None,
2025-05-07T20:32:42.2150539Z     contiguous=False,
2025-05-07T20:32:42.2150632Z     compiled=False,
2025-05-07T20:32:42.2150708Z )
2025-05-07T20:32:42.2150920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2151105Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.2151110Z 
2025-05-07T20:32:42.2151184Z     @given(
2025-05-07T20:32:42.2151308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2151411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2151530Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2151664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2151789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2151867Z     )
2025-05-07T20:32:42.2152127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2152224Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2152305Z         self,
2025-05-07T20:32:42.2152395Z         T: int,
2025-05-07T20:32:42.2152472Z         D: int,
2025-05-07T20:32:42.2152570Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2152677Z         contiguous: bool,
2025-05-07T20:32:42.2152765Z         compiled: bool,
2025-05-07T20:32:42.2152860Z     ) -> None:
2025-05-07T20:32:42.2152956Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2153031Z     
2025-05-07T20:32:42.2153206Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2155036Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2155084Z 
2025-05-07T20:32:42.2155210Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2155215Z 
2025-05-07T20:32:42.2155316Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2155536Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2155621Z     T=2048,
2025-05-07T20:32:42.2155697Z     D=7168,
2025-05-07T20:32:42.2155781Z     scale_ub=1200.0,
2025-05-07T20:32:42.2155872Z     contiguous=True,
2025-05-07T20:32:42.2155999Z     compiled=True,
2025-05-07T20:32:42.2156080Z )
2025-05-07T20:32:42.2156302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2156476Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.2156481Z 
2025-05-07T20:32:42.2156568Z     @given(
2025-05-07T20:32:42.2156681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2156775Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2156898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2157014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2157124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2157206Z     )
2025-05-07T20:32:42.2157447Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2157547Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2157623Z         self,
2025-05-07T20:32:42.2157701Z         T: int,
2025-05-07T20:32:42.2157792Z         D: int,
2025-05-07T20:32:42.2157994Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2158100Z         contiguous: bool,
2025-05-07T20:32:42.2158214Z         compiled: bool,
2025-05-07T20:32:42.2158297Z     ) -> None:
2025-05-07T20:32:42.2158393Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2158483Z     
2025-05-07T20:32:42.2158653Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2158726Z     
2025-05-07T20:32:42.2158834Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2158959Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2160734Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2160747Z 
2025-05-07T20:32:42.2160862Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.2160867Z 
2025-05-07T20:32:42.2160977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2161195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2161285Z     T=2048,
2025-05-07T20:32:42.2161368Z     D=7168,
2025-05-07T20:32:42.2161449Z     scale_ub=None,
2025-05-07T20:32:42.2161529Z     contiguous=True,
2025-05-07T20:32:42.2161619Z     compiled=False,
2025-05-07T20:32:42.2161693Z )
2025-05-07T20:32:42.2161906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2162080Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2162087Z 
2025-05-07T20:32:42.2162168Z     @given(
2025-05-07T20:32:42.2162328Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2162475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2162589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2162713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2162825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2162902Z     )
2025-05-07T20:32:42.2163150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2163242Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2163318Z         self,
2025-05-07T20:32:42.2163403Z         T: int,
2025-05-07T20:32:42.2163477Z         D: int,
2025-05-07T20:32:42.2163690Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2163787Z         contiguous: bool,
2025-05-07T20:32:42.2163871Z         compiled: bool,
2025-05-07T20:32:42.2163946Z     ) -> None:
2025-05-07T20:32:42.2164091Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2164169Z     
2025-05-07T20:32:42.2164350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2164428Z     
2025-05-07T20:32:42.2164518Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.2166282Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2166287Z 
2025-05-07T20:32:42.2166401Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.2166406Z 
2025-05-07T20:32:42.2166518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2166742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2166862Z     T=1,
2025-05-07T20:32:42.2166949Z     D=7168,
2025-05-07T20:32:42.2167033Z     scale_ub=1200.0,
2025-05-07T20:32:42.2167116Z     contiguous=True,
2025-05-07T20:32:42.2167209Z     compiled=False,
2025-05-07T20:32:42.2167281Z )
2025-05-07T20:32:42.2167493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2167666Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2167670Z 
2025-05-07T20:32:42.2167748Z     @given(
2025-05-07T20:32:42.2167894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2168004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2168125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2168244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2168361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2168439Z     )
2025-05-07T20:32:42.2168689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2168780Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2168860Z         self,
2025-05-07T20:32:42.2168936Z         T: int,
2025-05-07T20:32:42.2169010Z         D: int,
2025-05-07T20:32:42.2169110Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2169197Z         contiguous: bool,
2025-05-07T20:32:42.2169279Z         compiled: bool,
2025-05-07T20:32:42.2169361Z     ) -> None:
2025-05-07T20:32:42.2169453Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2169528Z     
2025-05-07T20:32:42.2169700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2169773Z     
2025-05-07T20:32:42.2169864Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2169996Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2170088Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2170211Z         x0 = x[:, :D]
2025-05-07T20:32:42.2170300Z         x1 = x[:, D:]
2025-05-07T20:32:42.2170410Z     
2025-05-07T20:32:42.2170496Z         if contiguous:
2025-05-07T20:32:42.2170587Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2170672Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2170747Z     
2025-05-07T20:32:42.2170834Z         if scale_ub is not None:
2025-05-07T20:32:42.2170937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2171075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2171146Z             )
2025-05-07T20:32:42.2171217Z         else:
2025-05-07T20:32:42.2171318Z             scale_ub_tensor = None
2025-05-07T20:32:42.2171391Z     
2025-05-07T20:32:42.2171520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2171617Z             op = silu_mul_quant
2025-05-07T20:32:42.2171700Z             if compiled:
2025-05-07T20:32:42.2171845Z                 op = torch.compile(op)
2025-05-07T20:32:42.2171953Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2172030Z     
2025-05-07T20:32:42.2172128Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2172133Z 
2025-05-07T20:32:42.2172229Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2172355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2172461Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2172559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2173060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2173164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2173524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2173751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2174102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2174238Z     kernel = self.compile(
2025-05-07T20:32:42.2174629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2174802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2174937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2174942Z 
2025-05-07T20:32:42.2175146Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286cfc410>
2025-05-07T20:32:42.2175916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2176424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286c68680>}
2025-05-07T20:32:42.2177177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2177373Z context = <triton._C.libtriton.ir.context object at 0x7fb286c303b0>
2025-05-07T20:32:42.2177378Z 
2025-05-07T20:32:42.2177542Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2177805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2177920Z                            module_map=module_map)
2025-05-07T20:32:42.2178080Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2178184Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2178261Z E       ^
2025-05-07T20:32:42.2178660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2178759Z 
2025-05-07T20:32:42.2179181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2179186Z 
2025-05-07T20:32:42.2179288Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2179516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2179594Z     T=128,
2025-05-07T20:32:42.2179672Z     D=5120,
2025-05-07T20:32:42.2179759Z     scale_ub=None,
2025-05-07T20:32:42.2179840Z     contiguous=True,
2025-05-07T20:32:42.2179920Z     compiled=False,
2025-05-07T20:32:42.2180002Z )
2025-05-07T20:32:42.2180221Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2180388Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2180392Z 
2025-05-07T20:32:42.2180521Z     @given(
2025-05-07T20:32:42.2180642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2180748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2180860Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2180975Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2181091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2181165Z     )
2025-05-07T20:32:42.2181410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2181506Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2181581Z         self,
2025-05-07T20:32:42.2181656Z         T: int,
2025-05-07T20:32:42.2181738Z         D: int,
2025-05-07T20:32:42.2181834Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2181918Z         contiguous: bool,
2025-05-07T20:32:42.2182008Z         compiled: bool,
2025-05-07T20:32:42.2182080Z     ) -> None:
2025-05-07T20:32:42.2182185Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2182254Z     
2025-05-07T20:32:42.2182424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2182551Z     
2025-05-07T20:32:42.2182641Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2182763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2182859Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2182938Z         x0 = x[:, :D]
2025-05-07T20:32:42.2183015Z         x1 = x[:, D:]
2025-05-07T20:32:42.2183091Z     
2025-05-07T20:32:42.2183173Z         if contiguous:
2025-05-07T20:32:42.2183265Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2183357Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2183429Z     
2025-05-07T20:32:42.2183516Z         if scale_ub is not None:
2025-05-07T20:32:42.2183626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2183757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2183837Z             )
2025-05-07T20:32:42.2183913Z         else:
2025-05-07T20:32:42.2184002Z             scale_ub_tensor = None
2025-05-07T20:32:42.2184087Z     
2025-05-07T20:32:42.2184218Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2184306Z             op = silu_mul_quant
2025-05-07T20:32:42.2184395Z             if compiled:
2025-05-07T20:32:42.2184495Z                 op = torch.compile(op)
2025-05-07T20:32:42.2184597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2184678Z     
2025-05-07T20:32:42.2184765Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2184769Z 
2025-05-07T20:32:42.2184869Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2184996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2185093Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2185196Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2185695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2185840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2186210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2186469Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2186816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2186907Z     kernel = self.compile(
2025-05-07T20:32:42.2187290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2187471Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2187594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2187599Z 
2025-05-07T20:32:42.2187802Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286c871d0>
2025-05-07T20:32:42.2188646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2189148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286c698a0>}
2025-05-07T20:32:42.2189901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2190087Z context = <triton._C.libtriton.ir.context object at 0x7fb286c3f7f0>
2025-05-07T20:32:42.2190091Z 
2025-05-07T20:32:42.2190259Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2190523Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2190630Z                            module_map=module_map)
2025-05-07T20:32:42.2190846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2190941Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2191013Z E       ^
2025-05-07T20:32:42.2191374Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2191378Z 
2025-05-07T20:32:42.2191790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2191794Z 
2025-05-07T20:32:42.2191903Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2192123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2192199Z     T=128,
2025-05-07T20:32:42.2192282Z     D=7168,
2025-05-07T20:32:42.2192363Z     scale_ub=None,
2025-05-07T20:32:42.2192449Z     contiguous=True,
2025-05-07T20:32:42.2192541Z     compiled=False,
2025-05-07T20:32:42.2192617Z )
2025-05-07T20:32:42.2192844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2193019Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2193023Z 
2025-05-07T20:32:42.2193098Z     @given(
2025-05-07T20:32:42.2193222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2193323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2193435Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2193556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2193671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2193754Z     )
2025-05-07T20:32:42.2193996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2194088Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2194172Z         self,
2025-05-07T20:32:42.2194252Z         T: int,
2025-05-07T20:32:42.2194326Z         D: int,
2025-05-07T20:32:42.2194478Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2194608Z         contiguous: bool,
2025-05-07T20:32:42.2194689Z         compiled: bool,
2025-05-07T20:32:42.2194774Z     ) -> None:
2025-05-07T20:32:42.2194868Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2194940Z     
2025-05-07T20:32:42.2195119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2195191Z     
2025-05-07T20:32:42.2195282Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2195412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2195499Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2195585Z         x0 = x[:, :D]
2025-05-07T20:32:42.2195661Z         x1 = x[:, D:]
2025-05-07T20:32:42.2195733Z     
2025-05-07T20:32:42.2195820Z         if contiguous:
2025-05-07T20:32:42.2195908Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2196034Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2196114Z     
2025-05-07T20:32:42.2196203Z         if scale_ub is not None:
2025-05-07T20:32:42.2196310Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2196450Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2196523Z             )
2025-05-07T20:32:42.2196598Z         else:
2025-05-07T20:32:42.2196697Z             scale_ub_tensor = None
2025-05-07T20:32:42.2196769Z     
2025-05-07T20:32:42.2196907Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2196993Z             op = silu_mul_quant
2025-05-07T20:32:42.2197073Z             if compiled:
2025-05-07T20:32:42.2197176Z                 op = torch.compile(op)
2025-05-07T20:32:42.2197279Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2197349Z     
2025-05-07T20:32:42.2197447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2197451Z 
2025-05-07T20:32:42.2197547Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2197675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2197827Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2197926Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2198427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2198524Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2198882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2199109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2199771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2199864Z     kernel = self.compile(
2025-05-07T20:32:42.2200252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2200431Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2200567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2200575Z 
2025-05-07T20:32:42.2200782Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286f962d0>
2025-05-07T20:32:42.2201545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2202047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286c6a7a0>}
2025-05-07T20:32:42.2202796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2203038Z context = <triton._C.libtriton.ir.context object at 0x7fb286f1a8f0>
2025-05-07T20:32:42.2203083Z 
2025-05-07T20:32:42.2203248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2203515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2203731Z                            module_map=module_map)
2025-05-07T20:32:42.2203890Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2203994Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2204069Z E       ^
2025-05-07T20:32:42.2204422Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2204427Z 
2025-05-07T20:32:42.2204845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2204849Z 
2025-05-07T20:32:42.2204994Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2205229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2205311Z     T=2048,
2025-05-07T20:32:42.2205387Z     D=7168,
2025-05-07T20:32:42.2205475Z     scale_ub=1200.0,
2025-05-07T20:32:42.2205556Z     contiguous=True,
2025-05-07T20:32:42.2205640Z     compiled=False,
2025-05-07T20:32:42.2205721Z )
2025-05-07T20:32:42.2205938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2206111Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2206123Z 
2025-05-07T20:32:42.2206201Z     @given(
2025-05-07T20:32:42.2206318Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2206424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2206533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2206649Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2206770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2206892Z     )
2025-05-07T20:32:42.2207138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2207240Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2207317Z         self,
2025-05-07T20:32:42.2207392Z         T: int,
2025-05-07T20:32:42.2207474Z         D: int,
2025-05-07T20:32:42.2207570Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2207664Z         contiguous: bool,
2025-05-07T20:32:42.2207747Z         compiled: bool,
2025-05-07T20:32:42.2207822Z     ) -> None:
2025-05-07T20:32:42.2207923Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2207995Z     
2025-05-07T20:32:42.2208163Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2209962Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2209973Z 
2025-05-07T20:32:42.2210092Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2210097Z 
2025-05-07T20:32:42.2210206Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2210426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2210499Z     T=1,
2025-05-07T20:32:42.2210589Z     D=5120,
2025-05-07T20:32:42.2210670Z     scale_ub=1200.0,
2025-05-07T20:32:42.2210761Z     contiguous=True,
2025-05-07T20:32:42.2210842Z     compiled=False,
2025-05-07T20:32:42.2210913Z )
2025-05-07T20:32:42.2211140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2211349Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2211396Z 
2025-05-07T20:32:42.2211470Z     @given(
2025-05-07T20:32:42.2211593Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2211690Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2211802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2211924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2212033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2212111Z     )
2025-05-07T20:32:42.2212353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2212443Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2212524Z         self,
2025-05-07T20:32:42.2212598Z         T: int,
2025-05-07T20:32:42.2212671Z         D: int,
2025-05-07T20:32:42.2212812Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2212901Z         contiguous: bool,
2025-05-07T20:32:42.2212987Z         compiled: bool,
2025-05-07T20:32:42.2213073Z     ) -> None:
2025-05-07T20:32:42.2213164Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2213234Z     
2025-05-07T20:32:42.2213408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2213482Z     
2025-05-07T20:32:42.2213576Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2213697Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2213781Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2213863Z         x0 = x[:, :D]
2025-05-07T20:32:42.2213939Z         x1 = x[:, D:]
2025-05-07T20:32:42.2214010Z     
2025-05-07T20:32:42.2214099Z         if contiguous:
2025-05-07T20:32:42.2214186Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2214274Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2214352Z     
2025-05-07T20:32:42.2214441Z         if scale_ub is not None:
2025-05-07T20:32:42.2214545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2214686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2214805Z             )
2025-05-07T20:32:42.2214885Z         else:
2025-05-07T20:32:42.2214984Z             scale_ub_tensor = None
2025-05-07T20:32:42.2215058Z     
2025-05-07T20:32:42.2215192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2215279Z             op = silu_mul_quant
2025-05-07T20:32:42.2215360Z             if compiled:
2025-05-07T20:32:42.2215464Z                 op = torch.compile(op)
2025-05-07T20:32:42.2215567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2215640Z     
2025-05-07T20:32:42.2215733Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2215738Z 
2025-05-07T20:32:42.2215833Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2215959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2216063Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2216162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2216671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2216767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2217127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2217360Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2217700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2217792Z     kernel = self.compile(
2025-05-07T20:32:42.2218182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2218355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2218487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2218541Z 
2025-05-07T20:32:42.2218804Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286f2e210>
2025-05-07T20:32:42.2219567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2220068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286c6bb00>}
2025-05-07T20:32:42.2225845Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2226157Z context = <triton._C.libtriton.ir.context object at 0x7fb286f16830>
2025-05-07T20:32:42.2226163Z 
2025-05-07T20:32:42.2226347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2226619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2226725Z                            module_map=module_map)
2025-05-07T20:32:42.2226898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2226997Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2227076Z E       ^
2025-05-07T20:32:42.2227444Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2227448Z 
2025-05-07T20:32:42.2227868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2227872Z 
2025-05-07T20:32:42.2227984Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2228210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2228290Z     T=2048,
2025-05-07T20:32:42.2228430Z     D=5120,
2025-05-07T20:32:42.2228516Z     scale_ub=None,
2025-05-07T20:32:42.2228601Z     contiguous=True,
2025-05-07T20:32:42.2228698Z     compiled=False,
2025-05-07T20:32:42.2228771Z )
2025-05-07T20:32:42.2228997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2229177Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2229181Z 
2025-05-07T20:32:42.2229264Z     @given(
2025-05-07T20:32:42.2229403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2229503Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2229619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2229745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2229859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2229944Z     )
2025-05-07T20:32:42.2230198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2230300Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2230395Z         self,
2025-05-07T20:32:42.2230473Z         T: int,
2025-05-07T20:32:42.2230553Z         D: int,
2025-05-07T20:32:42.2230661Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2230753Z         contiguous: bool,
2025-05-07T20:32:42.2230838Z         compiled: bool,
2025-05-07T20:32:42.2230928Z     ) -> None:
2025-05-07T20:32:42.2231024Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2231097Z     
2025-05-07T20:32:42.2231278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2231353Z     
2025-05-07T20:32:42.2231448Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.2233286Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2233331Z 
2025-05-07T20:32:42.2233461Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.2233465Z 
2025-05-07T20:32:42.2233571Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2233792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2233881Z     T=16384,
2025-05-07T20:32:42.2233959Z     D=5120,
2025-05-07T20:32:42.2234042Z     scale_ub=None,
2025-05-07T20:32:42.2234134Z     contiguous=True,
2025-05-07T20:32:42.2234221Z     compiled=False,
2025-05-07T20:32:42.2234296Z )
2025-05-07T20:32:42.2234562Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2234746Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2234756Z 
2025-05-07T20:32:42.2234848Z     @given(
2025-05-07T20:32:42.2234967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2235065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2235193Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2235310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2235425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2235508Z     )
2025-05-07T20:32:42.2235754Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2235852Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2235940Z         self,
2025-05-07T20:32:42.2236020Z         T: int,
2025-05-07T20:32:42.2236104Z         D: int,
2025-05-07T20:32:42.2236209Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2236303Z         contiguous: bool,
2025-05-07T20:32:42.2236399Z         compiled: bool,
2025-05-07T20:32:42.2236525Z     ) -> None:
2025-05-07T20:32:42.2236622Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2236708Z     
2025-05-07T20:32:42.2236877Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2238998Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2239012Z 
2025-05-07T20:32:42.2239134Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2239139Z 
2025-05-07T20:32:42.2239245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2239485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2239562Z     T=4096,
2025-05-07T20:32:42.2239641Z     D=5120,
2025-05-07T20:32:42.2239734Z     scale_ub=None,
2025-05-07T20:32:42.2239818Z     contiguous=True,
2025-05-07T20:32:42.2239913Z     compiled=False,
2025-05-07T20:32:42.2239987Z )
2025-05-07T20:32:42.2240201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2240377Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2240381Z 
2025-05-07T20:32:42.2240460Z     @given(
2025-05-07T20:32:42.2240577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2240684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2240798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2240913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2241203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2241337Z     )
2025-05-07T20:32:42.2241589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2241686Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2241765Z         self,
2025-05-07T20:32:42.2241849Z         T: int,
2025-05-07T20:32:42.2241925Z         D: int,
2025-05-07T20:32:42.2242021Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2242123Z         contiguous: bool,
2025-05-07T20:32:42.2242208Z         compiled: bool,
2025-05-07T20:32:42.2242286Z     ) -> None:
2025-05-07T20:32:42.2242391Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2242463Z     
2025-05-07T20:32:42.2242634Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2244573Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2244585Z 
2025-05-07T20:32:42.2244715Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2244720Z 
2025-05-07T20:32:42.2244825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2245045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2245137Z     T=2048,
2025-05-07T20:32:42.2245217Z     D=5120,
2025-05-07T20:32:42.2245302Z     scale_ub=None,
2025-05-07T20:32:42.2245401Z     contiguous=False,
2025-05-07T20:32:42.2245485Z     compiled=False,
2025-05-07T20:32:42.2245560Z )
2025-05-07T20:32:42.2245795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2246040Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.2246047Z 
2025-05-07T20:32:42.2246133Z     @given(
2025-05-07T20:32:42.2246252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2246351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2246476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2246594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2246710Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2246799Z     )
2025-05-07T20:32:42.2247041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2247136Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2247230Z         self,
2025-05-07T20:32:42.2247310Z         T: int,
2025-05-07T20:32:42.2247388Z         D: int,
2025-05-07T20:32:42.2247500Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2247593Z         contiguous: bool,
2025-05-07T20:32:42.2247698Z         compiled: bool,
2025-05-07T20:32:42.2247782Z     ) -> None:
2025-05-07T20:32:42.2247884Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2247977Z     
2025-05-07T20:32:42.2248146Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2249903Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2249922Z 
2025-05-07T20:32:42.2250086Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2250094Z 
2025-05-07T20:32:42.2250235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2250464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2250544Z     T=4096,
2025-05-07T20:32:42.2250623Z     D=7168,
2025-05-07T20:32:42.2250712Z     scale_ub=None,
2025-05-07T20:32:42.2250800Z     contiguous=True,
2025-05-07T20:32:42.2250895Z     compiled=True,
2025-05-07T20:32:42.2250972Z )
2025-05-07T20:32:42.2251187Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2251369Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.2251373Z 
2025-05-07T20:32:42.2251450Z     @given(
2025-05-07T20:32:42.2251567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2251676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2251830Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2251952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2252078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2252149Z     )
2025-05-07T20:32:42.2252396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2252499Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2252581Z         self,
2025-05-07T20:32:42.2252663Z         T: int,
2025-05-07T20:32:42.2252746Z         D: int,
2025-05-07T20:32:42.2252842Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2252929Z         contiguous: bool,
2025-05-07T20:32:42.2253023Z         compiled: bool,
2025-05-07T20:32:42.2253100Z     ) -> None:
2025-05-07T20:32:42.2253193Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2253272Z     
2025-05-07T20:32:42.2253440Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2255213Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2255265Z 
2025-05-07T20:32:42.2255381Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2255386Z 
2025-05-07T20:32:42.2255493Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2255712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2255789Z     T=2048,
2025-05-07T20:32:42.2255873Z     D=5120,
2025-05-07T20:32:42.2255955Z     scale_ub=1200.0,
2025-05-07T20:32:42.2256038Z     contiguous=False,
2025-05-07T20:32:42.2256129Z     compiled=False,
2025-05-07T20:32:42.2256204Z )
2025-05-07T20:32:42.2256420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2256601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.2256606Z 
2025-05-07T20:32:42.2256681Z     @given(
2025-05-07T20:32:42.2256803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2256900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2257010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2257131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2257241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2257317Z     )
2025-05-07T20:32:42.2257570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2257665Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2257750Z         self,
2025-05-07T20:32:42.2257828Z         T: int,
2025-05-07T20:32:42.2257905Z         D: int,
2025-05-07T20:32:42.2258082Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2258210Z         contiguous: bool,
2025-05-07T20:32:42.2258294Z         compiled: bool,
2025-05-07T20:32:42.2258383Z     ) -> None:
2025-05-07T20:32:42.2258476Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2258549Z     
2025-05-07T20:32:42.2258724Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2260526Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2260536Z 
2025-05-07T20:32:42.2260665Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2260672Z 
2025-05-07T20:32:42.2260774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2260995Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2261080Z     T=4096,
2025-05-07T20:32:42.2261154Z     D=7168,
2025-05-07T20:32:42.2261241Z     scale_ub=1200.0,
2025-05-07T20:32:42.2261324Z     contiguous=True,
2025-05-07T20:32:42.2261406Z     compiled=False,
2025-05-07T20:32:42.2261490Z )
2025-05-07T20:32:42.2261705Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2261875Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2261879Z 
2025-05-07T20:32:42.2261962Z     @given(
2025-05-07T20:32:42.2262077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2262179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2262301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2262462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2262580Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2262655Z     )
2025-05-07T20:32:42.2262900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2262999Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2263073Z         self,
2025-05-07T20:32:42.2263155Z         T: int,
2025-05-07T20:32:42.2263237Z         D: int,
2025-05-07T20:32:42.2263335Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2263421Z         contiguous: bool,
2025-05-07T20:32:42.2263515Z         compiled: bool,
2025-05-07T20:32:42.2263595Z     ) -> None:
2025-05-07T20:32:42.2263689Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2263772Z     
2025-05-07T20:32:42.2263943Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2265719Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2265730Z 
2025-05-07T20:32:42.2265846Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2265850Z 
2025-05-07T20:32:42.2265962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2266183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2266262Z     T=16384,
2025-05-07T20:32:42.2266349Z     D=7168,
2025-05-07T20:32:42.2266436Z     scale_ub=None,
2025-05-07T20:32:42.2266523Z     contiguous=False,
2025-05-07T20:32:42.2266671Z     compiled=True,
2025-05-07T20:32:42.2266788Z )
2025-05-07T20:32:42.2267003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2267188Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.2267193Z 
2025-05-07T20:32:42.2267270Z     @given(
2025-05-07T20:32:42.2267390Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2267487Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2267602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2267729Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2267852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2267940Z     )
2025-05-07T20:32:42.2268216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2268348Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2268436Z         self,
2025-05-07T20:32:42.2268517Z         T: int,
2025-05-07T20:32:42.2268598Z         D: int,
2025-05-07T20:32:42.2268704Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2268790Z         contiguous: bool,
2025-05-07T20:32:42.2268874Z         compiled: bool,
2025-05-07T20:32:42.2268960Z     ) -> None:
2025-05-07T20:32:42.2269054Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2269130Z     
2025-05-07T20:32:42.2269303Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2271071Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2271121Z 
2025-05-07T20:32:42.2271244Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2271249Z 
2025-05-07T20:32:42.2271351Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2271576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2271652Z     T=4096,
2025-05-07T20:32:42.2271730Z     D=7168,
2025-05-07T20:32:42.2271816Z     scale_ub=None,
2025-05-07T20:32:42.2271900Z     contiguous=True,
2025-05-07T20:32:42.2271983Z     compiled=False,
2025-05-07T20:32:42.2272065Z )
2025-05-07T20:32:42.2272280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2272449Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2272453Z 
2025-05-07T20:32:42.2272538Z     @given(
2025-05-07T20:32:42.2272656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2272756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2272885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2272999Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2273120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2273194Z     )
2025-05-07T20:32:42.2273436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2273536Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2273612Z         self,
2025-05-07T20:32:42.2273689Z         T: int,
2025-05-07T20:32:42.2273772Z         D: int,
2025-05-07T20:32:42.2273870Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2273956Z         contiguous: bool,
2025-05-07T20:32:42.2274051Z         compiled: bool,
2025-05-07T20:32:42.2274128Z     ) -> None:
2025-05-07T20:32:42.2274222Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2274307Z     
2025-05-07T20:32:42.2274524Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2276346Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2276390Z 
2025-05-07T20:32:42.2276509Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2276513Z 
2025-05-07T20:32:42.2276620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2276838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2276955Z     T=16384,
2025-05-07T20:32:42.2277042Z     D=7168,
2025-05-07T20:32:42.2277125Z     scale_ub=None,
2025-05-07T20:32:42.2277213Z     contiguous=True,
2025-05-07T20:32:42.2277300Z     compiled=False,
2025-05-07T20:32:42.2277376Z )
2025-05-07T20:32:42.2277591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2277772Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2277777Z 
2025-05-07T20:32:42.2277862Z     @given(
2025-05-07T20:32:42.2278000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2278111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2278248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2278375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2278489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2278564Z     )
2025-05-07T20:32:42.2278813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2278909Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2279033Z         self,
2025-05-07T20:32:42.2279109Z         T: int,
2025-05-07T20:32:42.2279186Z         D: int,
2025-05-07T20:32:42.2279294Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2279384Z         contiguous: bool,
2025-05-07T20:32:42.2279466Z         compiled: bool,
2025-05-07T20:32:42.2279550Z     ) -> None:
2025-05-07T20:32:42.2279642Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2279715Z     
2025-05-07T20:32:42.2279890Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2281663Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2281672Z 
2025-05-07T20:32:42.2281794Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2281798Z 
2025-05-07T20:32:42.2281905Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2282135Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2282213Z     T=16384,
2025-05-07T20:32:42.2282292Z     D=7168,
2025-05-07T20:32:42.2282383Z     scale_ub=1200.0,
2025-05-07T20:32:42.2282466Z     contiguous=True,
2025-05-07T20:32:42.2282552Z     compiled=False,
2025-05-07T20:32:42.2282634Z )
2025-05-07T20:32:42.2282848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2283023Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2283029Z 
2025-05-07T20:32:42.2283112Z     @given(
2025-05-07T20:32:42.2283274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2283409Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2283752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2283870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2283990Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2284066Z     )
2025-05-07T20:32:42.2284308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2284408Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2284484Z         self,
2025-05-07T20:32:42.2284564Z         T: int,
2025-05-07T20:32:42.2284648Z         D: int,
2025-05-07T20:32:42.2284743Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2284831Z         contiguous: bool,
2025-05-07T20:32:42.2284923Z         compiled: bool,
2025-05-07T20:32:42.2284998Z     ) -> None:
2025-05-07T20:32:42.2285134Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2285213Z     
2025-05-07T20:32:42.2285391Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2287164Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2287171Z 
2025-05-07T20:32:42.2287285Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2287290Z 
2025-05-07T20:32:42.2287399Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2287622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2287701Z     T=128,
2025-05-07T20:32:42.2287851Z     D=5120,
2025-05-07T20:32:42.2287937Z     scale_ub=1200.0,
2025-05-07T20:32:42.2288025Z     contiguous=False,
2025-05-07T20:32:42.2288115Z     compiled=False,
2025-05-07T20:32:42.2288190Z )
2025-05-07T20:32:42.2288405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2288580Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.2288585Z 
2025-05-07T20:32:42.2288663Z     @given(
2025-05-07T20:32:42.2288786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2288882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2288996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2289117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2289229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2289306Z     )
2025-05-07T20:32:42.2289560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2289657Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2289748Z         self,
2025-05-07T20:32:42.2289834Z         T: int,
2025-05-07T20:32:42.2289911Z         D: int,
2025-05-07T20:32:42.2290009Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2290103Z         contiguous: bool,
2025-05-07T20:32:42.2290192Z         compiled: bool,
2025-05-07T20:32:42.2290268Z     ) -> None:
2025-05-07T20:32:42.2290370Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2290444Z     
2025-05-07T20:32:42.2290619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2290693Z     
2025-05-07T20:32:42.2290786Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2290919Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2291008Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2291088Z         x0 = x[:, :D]
2025-05-07T20:32:42.2291177Z         x1 = x[:, D:]
2025-05-07T20:32:42.2291249Z     
2025-05-07T20:32:42.2291385Z         if contiguous:
2025-05-07T20:32:42.2291523Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2291613Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2291688Z     
2025-05-07T20:32:42.2291786Z         if scale_ub is not None:
2025-05-07T20:32:42.2291891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2292034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2292110Z             )
2025-05-07T20:32:42.2292184Z         else:
2025-05-07T20:32:42.2292284Z             scale_ub_tensor = None
2025-05-07T20:32:42.2292357Z     
2025-05-07T20:32:42.2292487Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2292585Z             op = silu_mul_quant
2025-05-07T20:32:42.2292667Z             if compiled:
2025-05-07T20:32:42.2292766Z                 op = torch.compile(op)
2025-05-07T20:32:42.2292923Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2292999Z     
2025-05-07T20:32:42.2293094Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2293102Z 
2025-05-07T20:32:42.2293209Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2293339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2293443Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2293542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2294040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2294140Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2294503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2294724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2295077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2295170Z     kernel = self.compile(
2025-05-07T20:32:42.2295603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2295778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2295904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2295908Z 
2025-05-07T20:32:42.2296116Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286ba0550>
2025-05-07T20:32:42.2296887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2297397Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286dd2700>}
2025-05-07T20:32:42.2298147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2298341Z context = <triton._C.libtriton.ir.context object at 0x7fb286ba49b0>
2025-05-07T20:32:42.2298352Z 
2025-05-07T20:32:42.2298513Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2298775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2298888Z                            module_map=module_map)
2025-05-07T20:32:42.2299046Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2299140Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2299223Z E       ^
2025-05-07T20:32:42.2299578Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2299583Z 
2025-05-07T20:32:42.2300042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2300084Z 
2025-05-07T20:32:42.2300191Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2300412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2300497Z     T=2048,
2025-05-07T20:32:42.2300571Z     D=7168,
2025-05-07T20:32:42.2300650Z     scale_ub=None,
2025-05-07T20:32:42.2300741Z     contiguous=False,
2025-05-07T20:32:42.2300825Z     compiled=False,
2025-05-07T20:32:42.2300898Z )
2025-05-07T20:32:42.2301121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2301292Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.2301297Z 
2025-05-07T20:32:42.2301381Z     @given(
2025-05-07T20:32:42.2301536Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2301631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2301757Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2301875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2301986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2302069Z     )
2025-05-07T20:32:42.2302311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2302411Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2302487Z         self,
2025-05-07T20:32:42.2302562Z         T: int,
2025-05-07T20:32:42.2302644Z         D: int,
2025-05-07T20:32:42.2302740Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2302828Z         contiguous: bool,
2025-05-07T20:32:42.2302919Z         compiled: bool,
2025-05-07T20:32:42.2302994Z     ) -> None:
2025-05-07T20:32:42.2303088Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2303170Z     
2025-05-07T20:32:42.2303339Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2305129Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2305180Z 
2025-05-07T20:32:42.2305300Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2305306Z 
2025-05-07T20:32:42.2305406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2305634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2305709Z     T=128,
2025-05-07T20:32:42.2305798Z     D=7168,
2025-05-07T20:32:42.2305876Z     scale_ub=1200.0,
2025-05-07T20:32:42.2305966Z     contiguous=True,
2025-05-07T20:32:42.2306054Z     compiled=True,
2025-05-07T20:32:42.2306125Z )
2025-05-07T20:32:42.2306339Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2306513Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.2306517Z 
2025-05-07T20:32:42.2306595Z     @given(
2025-05-07T20:32:42.2306709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2306812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2306921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2307044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2307153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2307229Z     )
2025-05-07T20:32:42.2307477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2307574Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2307692Z         self,
2025-05-07T20:32:42.2307780Z         T: int,
2025-05-07T20:32:42.2307892Z         D: int,
2025-05-07T20:32:42.2307988Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2308083Z         contiguous: bool,
2025-05-07T20:32:42.2308166Z         compiled: bool,
2025-05-07T20:32:42.2308241Z     ) -> None:
2025-05-07T20:32:42.2308342Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2308416Z     
2025-05-07T20:32:42.2308591Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2308667Z     
2025-05-07T20:32:42.2308759Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2308894Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2308982Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2309062Z         x0 = x[:, :D]
2025-05-07T20:32:42.2309146Z         x1 = x[:, D:]
2025-05-07T20:32:42.2309219Z     
2025-05-07T20:32:42.2309300Z         if contiguous:
2025-05-07T20:32:42.2309440Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2309537Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2309614Z     
2025-05-07T20:32:42.2309710Z         if scale_ub is not None:
2025-05-07T20:32:42.2309815Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2309956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2310032Z             )
2025-05-07T20:32:42.2310107Z         else:
2025-05-07T20:32:42.2310210Z             scale_ub_tensor = None
2025-05-07T20:32:42.2310282Z     
2025-05-07T20:32:42.2310410Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2310505Z             op = silu_mul_quant
2025-05-07T20:32:42.2310589Z             if compiled:
2025-05-07T20:32:42.2310686Z                 op = torch.compile(op)
2025-05-07T20:32:42.2310799Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2310872Z     
2025-05-07T20:32:42.2310960Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2310975Z 
2025-05-07T20:32:42.2311070Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2311199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2311351Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2311452Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2311822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.2311920Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.2312411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2312508Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2312876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2313100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2313453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2313550Z     kernel = self.compile(
2025-05-07T20:32:42.2313935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2314117Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2314241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2314246Z 
2025-05-07T20:32:42.2314460Z self = <triton.compiler.compiler.ASTSource object at 0x7fb286a1bc50>
2025-05-07T20:32:42.2315248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2315787Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb3efc44540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb286dd3f60>}
2025-05-07T20:32:42.2316542Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2316769Z context = <triton._C.libtriton.ir.context object at 0x7fb286abc170>
2025-05-07T20:32:42.2316773Z 
2025-05-07T20:32:42.2316946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2317208Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2317314Z                            module_map=module_map)
2025-05-07T20:32:42.2317480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2317576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2317657Z E       ^
2025-05-07T20:32:42.2318076Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2318083Z 
2025-05-07T20:32:42.2318500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2318505Z 
2025-05-07T20:32:42.2318609Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2318831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2318912Z     T=128,
2025-05-07T20:32:42.2318988Z     D=7168,
2025-05-07T20:32:42.2319068Z     scale_ub=1200.0,
2025-05-07T20:32:42.2319160Z     contiguous=True,
2025-05-07T20:32:42.2319241Z     compiled=False,
2025-05-07T20:32:42.2319315Z )
2025-05-07T20:32:42.2319534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2319702Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2319707Z 
2025-05-07T20:32:42.2319785Z     @given(
2025-05-07T20:32:42.2319915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2320059Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2320172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2320296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2320406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2320487Z     )
2025-05-07T20:32:42.2320731Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2320824Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2320908Z         self,
2025-05-07T20:32:42.2320986Z         T: int,
2025-05-07T20:32:42.2321062Z         D: int,
2025-05-07T20:32:42.2321171Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2321260Z         contiguous: bool,
2025-05-07T20:32:42.2321344Z         compiled: bool,
2025-05-07T20:32:42.2321428Z     ) -> None:
2025-05-07T20:32:42.2321525Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2321601Z     
2025-05-07T20:32:42.2321777Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2321858Z     
2025-05-07T20:32:42.2321956Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2322080Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2323972Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2323985Z 
2025-05-07T20:32:42.2324100Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.2324109Z 
2025-05-07T20:32:42.2324211Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2324485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2324725Z     T=128,
2025-05-07T20:32:42.2324802Z     D=5120,
2025-05-07T20:32:42.2324893Z     scale_ub=1200.0,
2025-05-07T20:32:42.2324973Z     contiguous=True,
2025-05-07T20:32:42.2325052Z     compiled=True,
2025-05-07T20:32:42.2325128Z )
2025-05-07T20:32:42.2325342Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2325514Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.2325519Z 
2025-05-07T20:32:42.2325595Z     @given(
2025-05-07T20:32:42.2325712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2325816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2325931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2326085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2326204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2326278Z     )
2025-05-07T20:32:42.2326533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2326624Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2326698Z         self,
2025-05-07T20:32:42.2326777Z         T: int,
2025-05-07T20:32:42.2326854Z         D: int,
2025-05-07T20:32:42.2326948Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2327042Z         contiguous: bool,
2025-05-07T20:32:42.2327129Z         compiled: bool,
2025-05-07T20:32:42.2327202Z     ) -> None:
2025-05-07T20:32:42.2327298Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2327368Z     
2025-05-07T20:32:42.2327533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2327613Z     
2025-05-07T20:32:42.2327702Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2327826Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2329593Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2329648Z 
2025-05-07T20:32:42.2329773Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.2329778Z 
2025-05-07T20:32:42.2329880Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2330098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2330179Z     T=128,
2025-05-07T20:32:42.2330255Z     D=7168,
2025-05-07T20:32:42.2330338Z     scale_ub=None,
2025-05-07T20:32:42.2330426Z     contiguous=True,
2025-05-07T20:32:42.2330511Z     compiled=True,
2025-05-07T20:32:42.2330588Z )
2025-05-07T20:32:42.2330810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2330973Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.2330978Z 
2025-05-07T20:32:42.2331068Z     @given(
2025-05-07T20:32:42.2331184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2331283Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2331407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2331524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2331636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2331718Z     )
2025-05-07T20:32:42.2331958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2332049Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2332143Z         self,
2025-05-07T20:32:42.2332222Z         T: int,
2025-05-07T20:32:42.2332346Z         D: int,
2025-05-07T20:32:42.2332487Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2332572Z         contiguous: bool,
2025-05-07T20:32:42.2332665Z         compiled: bool,
2025-05-07T20:32:42.2332744Z     ) -> None:
2025-05-07T20:32:42.2332841Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2332919Z     
2025-05-07T20:32:42.2333084Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2334878Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2334895Z 
2025-05-07T20:32:42.2335013Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2335144Z =============================== warnings summary ===============================
2025-05-07T20:32:42.2335457Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.2335757Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.2336062Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.2336936Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:42.2337167Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:42.2337214Z 
2025-05-07T20:32:42.2337433Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:42.2337599Z ================= 1 failed, 1 deselected, 3 warnings in 14.28s =================
2025-05-07T20:32:43.9091040Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:43.9738341Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:32:43.9738956Z 
2025-05-07T20:32:43.9739759Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:32:43.9740378Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:32:43.9740791Z 
2025-05-07T20:32:43.9740797Z 
2025-05-07T20:32:43.9740829Z 
2025-05-07T20:32:43.9757897Z ##[error]Process completed with exit code 1.
2025-05-07T20:32:43.9846514Z Post job cleanup.
2025-05-07T20:32:44.0832499Z [command]/usr/bin/git version
2025-05-07T20:32:44.0877581Z git version 2.47.1
2025-05-07T20:32:44.0912508Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/c05eb607-b4a7-4c62-baaf-1e5e09a282de/.gitconfig'
2025-05-07T20:32:44.0922606Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/c05eb607-b4a7-4c62-baaf-1e5e09a282de' before making global git config changes
2025-05-07T20:32:44.0923477Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:32:44.0928214Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:32:44.0971621Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:32:44.1006073Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:32:44.1339901Z Entering 'external/asmjit'
2025-05-07T20:32:44.1406336Z Entering 'external/composable_kernel'
2025-05-07T20:32:44.1478514Z Entering 'external/cpuinfo'
2025-05-07T20:32:44.1552785Z Entering 'external/cutlass'
2025-05-07T20:32:44.1627210Z Entering 'external/googletest'
2025-05-07T20:32:44.1694918Z Entering 'external/hipify_torch'
2025-05-07T20:32:44.1761810Z Entering 'external/json'
2025-05-07T20:32:44.1851733Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:32:44.1878784Z http.https://github.com/.extraheader
2025-05-07T20:32:44.1892231Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:32:44.1923122Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:32:44.2253595Z Entering 'external/asmjit'
2025-05-07T20:32:44.2296004Z http.https://github.com/.extraheader
2025-05-07T20:32:44.2340588Z Entering 'external/composable_kernel'
2025-05-07T20:32:44.2384186Z http.https://github.com/.extraheader
2025-05-07T20:32:44.2433920Z Entering 'external/cpuinfo'
2025-05-07T20:32:44.2476735Z http.https://github.com/.extraheader
2025-05-07T20:32:44.2520429Z Entering 'external/cutlass'
2025-05-07T20:32:44.2565865Z http.https://github.com/.extraheader
2025-05-07T20:32:44.2616669Z Entering 'external/googletest'
2025-05-07T20:32:44.2661412Z http.https://github.com/.extraheader
2025-05-07T20:32:44.2704133Z Entering 'external/hipify_torch'
2025-05-07T20:32:44.2748213Z http.https://github.com/.extraheader
2025-05-07T20:32:44.2789949Z Entering 'external/json'
2025-05-07T20:32:44.2837306Z http.https://github.com/.extraheader
2025-05-07T20:32:44.2998426Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:32:44.3030343Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:32:44.3041450Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:32:44.3041809Z ##[endgroup]
2025-05-07T20:32:44.3143440Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:32:55.2834042Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:11.8890837Z Cleaning up orphan processes